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Abstract 

We study the Bayesian model averaging approach to learning Bayesian network structures (DAGs) 
from data. We develop new algorithms including the first algorithm that is able to efficiently sample 
DAGs according to the exact structure posterior. The DAG samples can then be used to construct 
estimators for the posterior of any feature. We theoretically prove good properties of our estimators 
and empirically show that our estimators considerably outperform the estimators from the previous 
state-of-the-art methods. 
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1. Introduction 


Bayesian networks are graphical representations of multivariate joint probability distributions and 
have been widely used in various data mining tasks for probabilistic inference and causal model¬ 
ing ( |Pearl[ 2000} Spirtes et al.[ 20011. The core of a Bayesian network (BN) representation is its 
Bayesian network structure. A Bayesian network structure is a DAG (directed acyclic graph) whose 
nodes represent the random variables Xi , X 2 , • • • > in the problem domain and whose edges 
correspond to the direct probabilistic dependencies. Semantically, a Bayesian network structure G 
encodes a set of conditional independence assumptions: for each variable (node) Xi, Xi is condi¬ 
tionally independent of its non-descendants given its parents. With the above semantics, a Bayesian 
network structure provides a compact representation for joint distributions and supports efficient 
algorithms for answering probabilistic queries. Furthermore, with its semantics, a Bayesian net¬ 
work structure can often provide a deep insight into the problem domain and open the door to the 
cause-and-effect analysis. 

In the last two decades, there have been a large number of researches focusing on the problem 
of learning Bayesian network structure(s) from the data. These researches deal with a common 
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real situation where the underlying Bayesian network is typically unknown so that it has to be 
learned from the observed data. One motivation for the structure learning is to use the learned 
structure for inference or decision making. For example, we can use the learned model to predict or 
classify a new instance of data. Another structure-learning motivation, which is more closely related 
to the semantics of Bayesian network structures, is for discovering the structure of the problem 
domain. For example, in the context of biological expression data, the discovery of the causal 
and dependence relation among different genes is often of primary interests. With the semantics 
of a Bayesian network structure G, the existence of an edge from node X to node y in G can 
be interpreted as the fact that variable X directly influences variable Y ; and the existence of a 
directed path from node X to node Y can be interpreted as the fact that X eventually influences Y. 
Furthermore, under certain assumptions ( Heckerman et al.[ 1999[ Spirtes et al.[ 20011, the existence 
of a directed path from node X to node Y indicates that X causes Y. Thus, with the learned 
Bayesian network structure, we can answer interesting questions such as whether gene X controls 
gene Y which in turn controls gene Z by examining whether there is a directed path from node X 
via node Y to node Z in the learned structure. Just as mentioned by Friedman and Koller ( 2003| l, the 
extraction of these kinds of interesting structural features is often the primary goal in the discovery 
task. 

There are several general approaches to learning BN structures. One approach is to treat it as a 
model selection problem. This approach defines a scoring criterion that measures how well a BN 
structure (DAG) fits the data and finds the DAG (or a set of equivalent DAGs) with the optimal score 


(Silander and Myllymaki |2006 

Jaakkola et al. 

2010| Yuan et al.| 20111 Malone et al.| 2011a|b 

Cussens 201 ItlYuan and Malonel|2012t|Malone and Yuan 2013 Cussens and Bartlettl|2013t|Yuan| 

and Malone 2013 

1. (In Bayesian approach, the score of a DAG G is simply the posterior p{G\D) 


of G given data D.) However, when the data size is small relative to the number of variables, the 
posterior p(G|ZJ) often gives significant support to a number of DAGs, and using a single maximum- 
a-posteriori (MAP) model could lead to unwarranted conclusions (Friedman and KolIer| 2003 1). It is 


therefore desirable to use the Bayesian model averaging approach by which the posterior probability 


of any feature of interest is computed by averaging over all the possible DAGs (Heckerman et al.| 

T^. 


Bayesian model averaging is, however, computationally challenging because the number of 
possible network structures is between and super-exponential in the number 

of variables n. Tractable algorithms have been developed for special cases of averaging over trees 
( Meila and JaakkoTa| 20061 and averaging over DAGs given a node ordering ( Dash and Cooper] 
20041. Since 2004, dynamic programming (DP) algorithms have been developed for computing 


exact posterior probabilities of structural features such as edges or subnetworks (Koivisto and Sood 


2004 |Koivisto| 2006| Tian and He[ 20091. These algorithms have exponential time and space 
complexity and are capable of handling Bayesian networks of moderate size with up to around 25 
variables (mainly due to their space cost 0(n2'^)). A big limitation of these algorithms is that they 
can only compute posteriors of modular features such as an edge but can not compute non-modular 
features such as a path (“is there a path from node X to node Y”), a combined path (“is there 
a path from node X via node Y to node Z” or “is there a path from node X to node Y and no 
path from node X to node Z”), or a limited-length path (“is there a path of length at most 3 from 
node X to node Y”). Recently, Parviainen and Koivisto| (2011 1 have developed a DP algorithm 
that can compute the exact posterior probability of a path feature under a certain assumption. (The 
assumption, called order-modular assumption, will be discussed in details soon.) This DP algorithm 
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has (even higher) exponential time and space complexity and can only handle a Bayesian network 
with fewer than 20 variables (mainly due to its space cost 0(3"^)). Since this DP algorithm can only 
deal with a path feature, all the other non-modular features (such as a combined path) which various 
users would be interested in for their corresponding problems still can not be computed by any DP 
algorithm proposed so far. Note that generally the posterior p{f\D) of a combined feature / = 
(/i) / 2 ) • • • 5 fj) can not be obtained only from the posterior of each individual feature p{fj\D) (j G 
{1,2,...,J}) because the independence among these features does not hold generally. Actually, 
by comparing p{f 2 \fi,D) with p{f 2 \D), a user can know the effect of the feature /i upon the 
feature / 2 ; but to obtain p{f 2 \fi, D) (= p{fi,f 2 \D)/p{fi\D)), the user typically needs to obtain 
p(/i) f 2 \D) first. Another limitation of all these DP algorithms is that it is very expensive for them 
to perform data prediction tasks. They can compute the exact posterior of a new observational data 
case p{x\D) but the algorithms have to be re-run for each new data case x. 


One solution to computing the posterior of an arbitrary non-modular feature is drawing DAG 
samples {Gi,..., Gt} from the posterior p{G\D), which can then be used to approximate the 
full Bayesian model averaging by estimating the posterior of an arbitrary feature / as p{f\D) 

T Y^=i f{Gi), or the posterior predictive distribution as p{x\D) k, ^ Ym=i viAGi)- A number of 


algorithms have been developed for drawing sample DAGs using the bootstrap technique (Friedman 
et al.[ 19991 or the Markov Chain Monte Carlo (MCMC) techniques (Madigan and York 1995| 


Friedman and Koller 2003; Eaton and Murphy] 2007[[Grzegorczyk and Husmeier[ 2008[ Niinimaki 

et al.[ 201 1[ Niinimaki and Koivisto[ 20131. Madigan and York ( 1995| l developed the Structure 
MCMC algorithm that uses the Metropolis-Hastings algorithm in the space of DAGs. Friedman and 


Koller (20031 developed the Order MCMC procedure that operates in the space of orders. The Order 


MCMC was shown to be able to considerably improve over the Structure MCMC the mixing and 


convergence of the Markov chain and to outperform the bootstrap approach of Friedman et al. (19991 


as well. Eaton and Murphy| (2007 1 developed the Hybrid MCMC method (i.e., DP-fMCMC method) 
that first runs the DP algorithm of |Koivisto| (|2006|) to develop a global proposal distribution and 


then runs the MCMC phase in the DAG space. Their experiments showed that the Hybrid MCMC 
converged faster than both the Structure MCMC and the Order MCMC so that the Hybrid MCMC 
resulted in more accurate structure learning performance. An improved MCMC algorithm (often 
denoted as REV-MCMC ) traversing in the DAG space with the addition of a new edge reversal move 


was developed by Grzegorczyk and Husmeier (20081 and was shown to be superior to the Structure 
MCMC and nearly as efficient as the Order MCMC in the mixing and convergence. Recently, 
Niinimaki et al. ( 2011[ l have proposed the Partial Order MCMC method which operates in the space 
of partial orders. The Partial Order MCMC includes the Order MCMC as its special case (by setting 
the parameter bucket size 6 to be 1) and has been shown to be superior to the Order MCMC in terms 
of the mixing and the structural learning performance when a more appropriate bucket size 6 > 1 is 
set. One common drawback of these MCMC algorithms is that there is no guarantee on the quality 
of the approximation in finite runs. The approach to approximating full Bayesian model averaging 


using the iT-best Bayesian network structures was studied by Tian et al. (2010) and was shown to 
be competitive with the Hybrid MCMC. 


Several of these state-of-the-art algorithms work in the order space, including the exact algo¬ 
rithms (Koivisto and Sood 2004 Koivisto[ 2006| Parviainen and Koivisto[ 201 1|) and the approxi¬ 


mate algorithms: the Order MCMC (Eriedman and Koller 20031 and the Partial Order MCMC (Ni- 


inimaki et al. 

2011 

1. They all assume a special form of the structure prior, termed as order-modular 

prior ( 

Eriedman anc 

[Kollerj 2003 

Koivisto and Sood 

20041, for computational convenience. (Please 
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refer to the beginning of Section [ZT] for the definition of the order-modular prior.) However, the as¬ 
sumption of order-modular prior has the consequence that the corresponding prior p{G) cannot 
represent some desirable priors such as a uniform prior over the DAG space; and the computed 
posterior probabilities are biased since a DAG that has a larger number of topological orders will 
be assigned a larger prior probability. Whether a computed posterior with the bias from the order- 
modular prior is inferior to its counterpart without such a bias depends on the application scenario 
and is beyond the scope of this paper. For the detailed discussion about this issue, please see 


fhe relafed papers ( 

Eriedman and Koller [ 2003 [ Grzegorczyk and Husmeier 

2008 Parviainen and 

Koivisfo 

20111. One mefhod fhaf helps fhe Order MCMC (Eriedman and Koller 

2003 

1 fo correcf 


In this paper, first we develop a new algorithm that uses the results of the DP algorithm of 
Koivisto and Sood ( 2004|) to efficiently sample orders according to the exact order posterior under 


the assumption of order-modular prior. Next, we develop a time-saving strategy for the process 
of sampling DAGs consistent with given orders. (Such a DAG-sampling process is based on sam¬ 
pling parents for each node as described by Friedman and KolIer| ( 2003] l by assuming a bounded 
node in-degree.) The resulting algorithm (called DDS) is the first algorithm that is able to sample 
DAGs according to the exact DAG posterior with the same order-modular prior assumption. We em¬ 
pirically show that our DDS algorithm is both considerably more accurate and considerably more 
efficient than the Order MCMC and the Partial Order MCMC when n is moderate so that our DDS 
algorithm is applicable. Moreover, the estimator based on our DDS algorithm has several desirable 
properties; for example, unlike the existing MCMC algorithms, the quality of our estimator can be 
guaranteed by controlling the number of DAGs sampled by our DDS algorithm. The main appli¬ 
cation of our DDS algorithm is to address the limitation of the exact DP algorithms ([Koivisto and 


Sood 2004 [Koivistoj 2006[ Parviainen and Koivistoj 201 Ij) (whose usage is restricted to modular 


features or path features) in order to estimate the posteriors of various non-modular features arbi¬ 
trarily specified by users. Additionally our DDS algorifhm can also be used fo efficienlly perform 
dafa prediction fasks in estimating p{x\D) for a large number of dafa cases (while fhe exacf DP 
algorifhm has fo be re-run for each case x). Finally, we develop an algorifhm (called IW-DDS) fo 
correcf fhe bias (due fo fhe order-modular prior) in fhe DDS algorifhm by exfending fhe idea of 


Ellis and Wong (20081. We Iheorefically prove fhaf fhe esfimafor based on our IW-DDS has sev¬ 


eral desirable properties; fhen we empirically show fhaf our esfimafor is superior fo fhe esfimafors 
based on fhe Hybrid MCMC mefhod (jEafon and Murph^[2007 1 and fhe iF-besf algorifhm (Tian 


ef al. 20101, fwo slale-of-fhe-arf algorifhms fhaf can esfimafe fhe posferior of any fealure wifhouf 


fhe order-modular prior assumpfion. Analogously, our IW-DDS algorifhm mainly addresses fhe 
limifafion of fhe exacf DP algorifhm of Tian and He ( [2009 1 (whose usage is resfricfed fo modular 
fealures) in order fo esfimafe fhe posferiors of arbifrary non-modular fealures and can additionally 
be used fo efficienfly perform dafa prediction fasks when an applicafion sifuafion prefers fo avoid 
fhe bias from order-modular prior. 


The resf of fhe paper is organized as follows. In Secfion we briefly review fhe Bayesian 
approach fo learning Bayesian nefworks from dafa, fhe relafed DP algorifhms ( [Koivisto and Sood] 
) and fhe Order MCMC algorifhm ( [Eriedman and Koller[[2003) . In Secfion 
we presenf our order sampling algorifhm, DDS algorifhm, and IW-DDS algorifhm; and prove good 
properties of fhe esfimafors based on our algorifhms. We empirically demonsfrafe fhe advanfages of 
our algorifhms in Secfion]^ Secfionj^concludes fhe paper. Einally, Appendix [^provides fhe proofs 
of all fhe conclusions including fhe propositions, fheorems and corollary referenced in fhe paper. 


2004 Koivisfo[ 2006 


4 

















































































Structure Learning in BNs of Moderate Size by Efficient Sampling 


2. Bayesian Learning of Bayesian Networks 


A Bayesian network is a DAG G that encodes a joint probability disttibution over a set X = 
{Xi,..., Xn] of random variables with each node of the DAG representing a variable in X. For 
convenience we typically work on the index set 1/ = {1,..., n} and represent a variable Xj by its 
index i. We use Xpa^ C X to represent the parent set of X* in a DAG G and use Pat C 1/ to 
represent the corresponding index set. (A pair (i, Poj) is often called a family.) Thus, a DAG G can 
be represented as a vector (Poi,..., Pan)- 

Assume that we are given a training data set D = {x ^, ,..., x”*}, where each x* is a particular 

instantiation over the set of variables X. We only consider situations where the data are complete, 
that is, every variable in X is assigned a value. In the Bayesian approach to learning Bayesian 
networks from the training data D, we compute the posterior probability of a DAG G as 

^ P{D\G)p{G) ^ p{D\G)p{G) 

^ P{D) T.GP{D\G)p{Gy 

Assuming global and local parameter independence, and parameter modularity, p{D\G) can be 
decomposed into a product of local marginal likelihoods (often called local scores) as ( [Cooper and 
|Herskovitsl|1992trHeckerman et al.[|1995| l 

n n 

p(oiG)=n p{^Xj}^Xpn^ . D) .— SCOTCii^Pa-i . f9), (1) 

i=l i=l 


where, with appropriate parameter priors, scoreyPat : D) (the local score for a family {i,Pai)) 
has a closed form solution. In this paper we will assume that these local scores can be computed 
efficiently from data. The standard assumption for structure prior p{G) is structure-modular prior 
( [Friedman and Kollerj 2003 1: 


P{G) = YipiiPai), 


( 2 ) 


2=1 


where pi is some nonnegative function over the subsets of 1/ — {i}. 

Combing Eq. ([^ and Eq. (|^, we have 

n 

p^{G, D) = p{D\G)p{G) = scoreyPai : D)pi{Pai). (3) 

2=1 

Note that the subscript y is intentionally added by us to mean that the corresponding probability 
is the one obtained without order-modular prior assumption. This is different from the probability 
computed with order-modular prior assumption, which will be marked by the subscript ^ for the 
distinction. 

We can compute the posterior probability of any hypothesis of interest by averaging over all 
the possible DAGs. Eor example, we are often interested in computing the posteriors of structural 
features. Eet / be a structural feature represented by an indicator function such that f{G) is 1 if the 
feature is present in G and 0 otherwise. By the full Bayesian model averaging, we have the posterior 
of / as 

p{f\D) = Y,f{G)p{G\D). (4) 

G 
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Note that p^(/|D) will be obtained \fp{G\D) in Eq. Q is p^{G\D)\p^{f\D) will be obtained 
if p{G\D) in Eq. Q is p^{G\D). This difference is the key to understanding the bias issue which 
will be described in details later. 

Since summing over all the possible DAGs is generally infeasible for any problem with n > 6 
using a contemporary computer, one approach to computing the posterior of / is to draw DAG sam¬ 
ples {Gi,..., Gt} from the posterior p^{G\D) or p^{G\D), which can then be used to estimate 
the posterior p^(/|L)) ovp^{f\D) as 

1 ^ 

p{f\D) = -Y,fm. ( 5 ) 

^=1 


2.1 The DP Algorithms 


The DP algorithms (Koivisto and Sood 2004 Koivisto 20061 work in the order space rather than 
the DAG space. We define an order -< of variables as a total order (a linear order) on V represented 
as a vector {Ui,... ,Un), where Ui is the set of predecessors of i in the order To be more clear 
we may use U^. We say that a DAG G = {Pa \,..., Pan) is consistent with an order {Ui ,..., Un), 
denoted by G C-<, if Pai C Ui for each i. If 5 is a subset of V, we let jC(S) denote the set of linear 
orders on S. In the following we will largely follow the notation from [Koivisto (20061. 

The algorithms working in the order space assume order-modular prior defined as follows: if G 
is consistent with then 


P{P,G) = '[{qiiUi)piiPai) 


( 6 ) 


2=1 


where each qi and pi is some function from the subsets of 1/ — {i} to the nonnegative real numbers. 
(If G is not consistent with then p(-<, G) = 0.) 

A modular feature is defined as: 


n 

f{G) = Wh{Pa,), 

2=1 

where fi{Pai) is an indicator function returning a 0/1 value. Eor example, an edge feature j i 
can be represented by setting fi{Pai) = 1 if and only if j G Pai and setting fi{Pai) = 1 for all 
I i. 

With the order-modular prior, we are interested in the posterior {f\D) = p^ (/, D)/p^ {D). 
p^{f\D) can be obtained if the joint probability p^{f,D) can be computed (since p^{D) = 
p^{f = 1,D) where / = 1, meaning that / always equals 1, can be easily achieved by setting 
each fi{Pai) to be the constant 1). Koivisto and Sood (2004) show that 

n 

p(/,^,ii)=n«i(c/^), (7) 

2=1 

and 

n 

= (8) 

-< i=l 
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where the function ai is defined for each i and each S'Cf/ — {i}as 

a,(5) = (Zi(5) A(Pa,), 


Pai<ZS 


in which the function fii is defined for each i and each Pai C f/ — {i} as 

fii{Pai) = fi{Pai)pi{Pai)scorei{Pai : D). 


Accordingly, the DP algorithm of |Koivisto and Sood| ( [2004] ) consists of the following three steps. 
The first step computes Pi{Pai) for each i ^ V and each Pai C 1/ — {f}. The time complexity of 
this step is 0{n^~^^C{m)) under the assumption of the maximum in-degree k, where n is the number 
of variables, and C{m) is the cost of computing a single local marginal likelihood scorei{Pai : D) 
for m data instances. The second step computes ai{S) for each i and each 5 C 1/ — {f}. With 
the assumed maximum in-degree k, this step takes 0{knT^) time by using the truncated Mdbius 
transform technique (Koivisto and Soo dl 20041 which is extended from the standard fast Mdbius 
transform algorithm (Kennes and Smets 19911. The third step computes p^{f,D) by defining the 


(9) 


following function (called forward contribution) for each S' C f/; 

^(^)= E 

^eC{S) i&s 

where is the set of variables in S ahead of i in the order AG P{S). It can be shown that for every 
S C V the corresponding L{S) can be computed recursively using the DP technique according to 
the following equation ( Koivisto and Sood] |2004| Koivisto[ 2006| ): 

L(S) = ^ai(5-{f})L(5-{f}), (10) 


ieS 


starting with L(0) = 1 and ending with L{V). From Eq. ([^ and Eq. Q, we have 

pM,D) = L{V). 


( 11 ) 


The third step takes 0(n2”) time when L{V) is computed using the above DP technique. In sum¬ 
mary, the whole DP algorithm of Koivisto and Sood] ( |2004 i can compute the posterior of any mod¬ 
ular feature (such as an edge feature) in 0{n^^^C{m) + kn2'^) time and 0{n2^) space. 

As the extended work of Koivisto and Sood] ( [2004[ ), Koivisto| ( 2006| l includes the DP algorithm 
of [Koivisto and Sood (20041 as its first three steps and appends two additional steps so that all 
the n{n — 1) edges can be computed in 0{n^^^C{m) + kn2^) time and 0(n2’^) space. The 
foundation of the two additional steps is the introduction of the following function (called backward 
contribution) for each T PV: 

^(^)= E (12) 

^'£C{T)i&T 

Eike L{S), R{T) can also be computed recursively using some DP technique. Please refer to the 
paper of |Koivist^ ( |2006 1 for further details of the two additional steps. 

While the introduced DP algorithms ( Koivisto and Sood||2004[ Koivisto 20061 make significant 
contributions to the structure learning of Bayesian networks, they have one fundamental limitation: 
they can only compute the posteriors of modular features. In the next section, we will show how to 


use the results of the DP algorithm of Koivisto and Sood (2004i to efficiently draw DAG samples. 


which can then be used to compute the posteriors of arbitrary features. 
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2.2 Order MCMC 

The idea of the Order MCMC is to use the Metropolis-Hastings algorithm to draw order samples 
{-< 1 ,..., that have \D) as the invariant distribution, where No is the number of sampled 
orders. For this purpose we need to be able to compute p{^,D), which can be obtained from Eq. Q 
by setting / = 1. Let /3^{Pai) denote j3i{Pai) resulted from setting each fi{Pai) to be the constant 
1. Similarly, we define a'(S') and L'{S) as the special cases of ai{S) and L{S) respectively by 
setting each fi{Pai) to be the constant 1. Then from Eq. (|7]l and ( [IT] ) we have 

n 

= (13) 

i=l 

and 

p^{D) = L'{V). (14) 


The Order MCMC can estimate the posterior of a modular feature as 


N„ 


pM\d) = 


(15) 


2 = 1 


Eor example, from Propositions 3.1 and 3.2 stated by [Eriedman and Koller (20031 as well as the 
definitions of /3' and a[, the posterior of a particular choice of parent set Pui C for node i given 
an order is 


pi{i,Pa,)\<,D) = ^jjMfM 


and the posterior of the edge feature j -f- i given an order is 


(16) 


pU 


^,D) = l- 




(17) 


In order to compute arbitrary non-modular features, we further draw DAG samples after drawing 
No order samples. Given an order, a DAG can be sampled by drawing parents for each node ac¬ 
cording to Eq. ( [T^ . Given DAG samples {Gi,..., Gt}, we can then estimate any feature posterior 
p^{f\D) using p^{f\D) shown in Eq. ([^. 


3. Order Sampling Algorithm and DAG Sampling Algorithms 

In this section we present our order sampling algorithm, DDS algorithm and IW-DDS algorithm. 
We also prove good properties of the estimators based on our algorithms. 


3.1 Order Sampling Algorithm 

In this subsection, we show that using the results including a^{S) (for each i ^ V and each S C 


V — {f}) and L'{S) (for each S C V) computed from the DP algorithm of Koivisto and Sood 


( 2004[ ), we can draw an order sample efficiently by drawing each element in the order one by one. 
Let an order ^ be represented as (ai ,..., (Jn) where Ui is the ith element in the order. 
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Proposition 1 The conditional probability that the kth (1 < k < n) element in the order is cjfc 
given that the n — k elements after it along the order are cifc+i ,... ,an respectively is as follows: 


Pi(^k\(^k+l T 1 tTrij D) — 


U{Uf 


(18) 


O'fc+l ^ 


where ak € V — {cJfc+i,..., an}, and Uf: = V — {ai, CTj+i,..., an} so that Uf. denotes the set of 
predecessors of ai in the order 
Specifically for k = n, we essentially have 

L'{V-{i})a[{V-{i}) 


p{an = i\D) = 


LfV) 


(19) 


where i £ V. 


Note that all the proofs in this paper are provided in Appendix [A| 

It is clear that for each fc G {1,. .., n}, P{(^k = A^k+i, ■ ■ ■, <^n, D) = 1 because of 

Eq. (101 and — {ak}- Thus, p{ak\ak+i, ■ ■ ■, an, D) is a probability mass function 

(pmfj with k possible a^ values from 

Based on Proposition [T] we propose the following order sampling algorithm to sample an order 

-<: 

• Sample an, the last element of the order according to Eq. ( fTO] ). 

• Eor each k from n — 1 down to 1: given the sampled {au+i, • • ■, an), sample ak, the kth 
element of the order according to Eq. (181. 

Sampling an order using the above algorithm takes only O(n^) time since sampling each ele¬ 
ment fjfc (/c G {1,..., n}) in the order takes 0{n) time. 

The following proposition guarantees the correctness of our order sampling algorithm. 

Proposition 2 An order -< sampled according to our order sampling algorithm has its pmf equal to 
the exact posterior p{^ \D) under the order-modular prior, because 


p{ak\ak+i, ...,an,D)= p[-< \D). 


( 20 ) 


A:=l 


The key to our order sampling algorithm is that we realize that the results including a'fS) 
and L'{S) computed from the DP algorithm of Koivisto and Sood ( 2004| are already sufficient 
to guide the order sampling process. In an abstract point of view, the results computed from the 
DP algorithm of |Koivisto and So^ (2004 1 are analogous to the answers provided by the ^P-oracle 
stated in Theorem 3.3 of |Jerrum et al. ( |1986| l. Theorem 3.3 of Jerrum et'aLl ( |1986| l states that with the 
aid of a #P-oracle that is always able to provide the exact counting information (the exact number) 
of accepting configurations from a currently given configuration, a probabilistic Turing Machine 
can serve as a uniform generator so that every accepting configuration will be reached with an 
equal positive probability. In our situation, instead of providing the exact counting information, the 


results computed from the DP algorithm of Koivisto and Sood (2004i is able to provide the exact 


joint probability p{ak, ak+i, ■ ■ ■, an, D) for a subsequence {ak,ak+i, ■ ■ ■, an) of any order ^ for 
any k G {1, 2,..., n}, which is shown in the proof for Proposition [I] in Appendix |A.1[ As a result, 
the order sampling can be efficiently performed based on the definition of conditional probability 
distribution. 
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3.2 DDS Algorithm 


After drawing an order sample, we ean then easily sample a DAG by drawing parents for each node 
according to Eq. ( fTh] ) as described by Friedman and Koller (20031 (assuming a maximum in-degree 
k). This naturally leads to our algorithm, termed Direct DAG Sampling (DDS), as follows: 


Step 1: Run the DP algorithm of Koivisto and Sood (20041 with each fi{Pai) set to be the 
constant 1. 


• Step 2 (Order Sampling Step): Sample No orders such that each order ^ is independently 
sampled according to our order sampling algorithm. 

• Step 3 (DAG Sampling Step): For each sampled order one DAG is independently sampled 
by drawing a parent set for each node of the DAG according to Eq. ( [T^ . 

The correctness of our DDS algorithm is guaranteed by the following theorem. 


Theorem 3 The No DAGs sampled according to the DDS algorithm are independent and identi¬ 
cally distributed (iid) with the pmf equal to the exact posterior p^{G\D) under the order-modular 
prior. 


The time complexity of the DDS algorithm is as follows. Step 1 takes 0{n^~^^C{m) + /cn2”) 
time ( Koivisto and Sood] 2004| ), which has been discussed in Section [TT] In Step 2, sampling each 
order takes 0(n‘^) time. In Step 3, sampling each DAG takes time. Thus, the overall time 

complexity of our DDS algorithm is 0{n^~^^C{m) + A:n2” + n?No + n^~^^No). Since typically we 
assume A: > 1, the order sampling process (Step 2) does not affect the overall time complexity of 
the DDS algorithm because of its efficiency. 

The time complexity of our DDS algorithm depends on the assumption of the maximum in¬ 


degree k. Such an assumption is fairly innocuous, as discussed on page 101 in the article of Fried- 
man and Koller| (|2003(), because DAGs with very large families tend to have low scores. (The 


maximum-in-degree assumption is also justified in the context of biological expression data on 


page 270 in the article of Grzegorczyk and Husmeier|2008|) Accordingly, this assumption has been 

widely used in the literature (Friedman and Koller 

2003t Koivisto and Sood[ 

2004 

Ellis and Wong| 

2008 Grzegorczyk and Husmeier 

2008 

Niinimaki et al. 

2011 

Parviainen and Koivisto 

20111 and 


the maximum in-degree k has been set to be no greater than 6 in all of their experiments. 

Note that the DAG sampling step of the DDS algorithm takes 0{n^^^No) time. This will 
actually dominate the overall running time of the DDS algorithm (even if k is assumed to be 3 or 4), 
when n is moderate (n < 25) and the sample size No reaches several thousands. Therefore, for the 
efficiency of our DDS algorithm, we have developed a time-saving strategy for the DAG sampling 
step, which will be described in details in Remark 

Given DAG samples, p^{f\D), an estimator for the exact posterior of any arbitrary feature /, 
can be constructed by Eq. 0 Fetting Cnj denote the time cost of determining the structural feature 
/ in a DAG of n nodes, constructing p^{f\D) takes 0{CnjNo) time. (For example, Cnj^ = 0(1) 
for an edge feature f^, and Cnjp = 0{'n?) for a path feature fp.) If we only need order samples, 
the algorithm consisting of Steps 1 and 2 will be called Direct Order Sampling (DOS). Given order 
samples, for some modular feature / such as a parent-set feature or an edge feature, p(/| -<i, D) can 
be computed by Eq. ([T^ or ([Tt]), and then p^ {f\D) can be estimated by Eq. (jlsj). (Since computing 
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a parent-set feature or an edge feature by Eq. ( [T^ or takes 0(1) time , estimating p^{f\D) by 
Eq.@ only takes 0{No) time for these two features.) 

As for the space costs of our DDS algorithm, note that both a total order and a DAG can be 
represented in 0{n) space (since a total order can be represented as a vector {Ui ,..., C/„) and a 
DAG can be represented as a vector (Pai,..., Pan))- Therefore, the overall memory requirement 
of our DDS algorithm is 0{n2'^ + nNo)'. Step 1 of our DDS takes 0(n2") memory space; and Steps 
2 and 3 of our DDS take 0{nNo) memory space. 

Due to Theorem]^ the estimator p_<(/|Zl) based on our DDS algorithm has the following desir¬ 
able properties. 

Corollary 4 For any structural feature f, with respect to the exact posterior p^ {f\D), the estimator 
p^{f\D) based on the Nq DAG samples from the DDS algorithm using Eq. (|^ has the following 
properties: 

(i) p^{f\D) is an unbiased estimator for p^{f\D). 

(ii) p^{f\D) converges almost surely to p^{f\D). 

(Hi) IfO < p^{f\D) < 1, then the random variable 

^o{pM\d)-vM\d)) 

has a limiting standard normal distribution. 

(iv) For any e > 0, any 0 < 5 < 1, if No > (ln(2/(5))/(2e^), thenp{\p^ {f\D)—p^{f\D)\ < e) 
> 1 - 6 . 


In particular, Coroll^ v[^(iv), which is essentially from Hoeffding bound ( Hoeffdin^ 1963 Koller 


and Eriedman 


20091: p{\p^{f\D) — p^{f\D)\ > e) < 2e , states that in order to ensure that 


the probability that the error of the estimator p^{f\D) from the DDS algorithm is bounded by e is 
at least 1 — 5, we just need to require the sample size No > (ln(2/5))/(2e^). This property, which 
the existing MCMC algorithms (Eriedman and Koller 2003| [Niinimaki et al.[ 20111 do not have, 
can be used to obtain quality guarantee for the estimator from our DDS algorithm. 


Remark 5 About our time-saving strategy for the DAG sampling step of the DDS. 

The running time of the DAG sampling step (Step 3) of the DDS algorithm is 0(n^~^^No), which 
will actually dominate the overall running time of the DDS algorithm when n is moderate and the 
sample size No reaches several thousands. Thus, in the following we will introduce our strategy for 
effectively reducing the running time of the DAG sampling step so that the efficiency of the overall 
DDS algorithm can be achieved. 

In the DAG sampling step, each sampled order Aj = (cii,..., (1 < * < No) can be 

represented as {(cti, • • •, {crn, where Uoj denotes the set of predecessors of aj in 

the order. For each sampled order Aj, for each (aj, Uaj)~^'‘ (1 ^ j ^ n), we need to sample one 
PcLaj of aj (one parent set of aj) from a list {Pa^jz}^^ including every parent set Poa-jz G Uff. 

Let Zj be the length of such a list. Since Zj = ~ 0{n^), sampling one Pa^^ 

for aj takes 0{rf) time and sampling one DAG takes 0{n^~^^) time. Note that Zj is actually an 
increasing function of j but in the following we use the notation Z instead of Zj for notational 
convenience when the context is clear. 
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However, for No > 1, the overall running time of the DAG sampling step can be reduced as fol¬ 
lows. Letot"'^"^^ be P{{aj,Paajz)\ Pi,D) = Pfuj, Paa^z)\{crj,Uoj)~^% D) = j3'o^{Pa„.fj / 
[a'^ .iU/ q„-{Uf^)], for z G {1,..., Z}. First, using the common strategy of sampling from 


a discrete distribution {Roller and Friedmam 


2009), for (aj, we can create Sj 




(aj ,U,TA) 


sequence of Z probability intervals with the form of < [0,9\ ), [0^ 

O 2 ^ ),•••, ^ ^ ), ^ ,1) >, where the Ith 

l-l (aj,U,r.pi I Ji^jPcTiPi^ 

tz , — ^ "2 




+ 


interval is [f2z=i^z ^ ) Z^ 2 =i ^ ). Note that ^ can be created in time 0{Z) 

and sampling one Pdaj far aj from a list {Paojz}z^ can then be achieved using binary search in 

time 0{logZ) based on s\ . Then the following observation is the key reason for reducing 

the running time of the DAG sampling step. For two sampled orders -<i and Ai' fl < i,f < Nq), 
even if it is possible that {aj, = {aj, Uajfai' for some j G {1,..., n}. This is 

because for each j, {aj, Ua-) essentially has a multinomial distribution with No trials and a set of 
re(”~j) cell probabilities {P{{aj, Uoj)\D)f Actually, for any j, the following relation holds for 
each cell probability: 


P{{aj, Uo^)\D) oc a'{Uoj)fa{Uo^)R\V - - {afa). 


( 21 ) 


where B!{.) is the special case of R{.) by setting / = 1 and R{.) is defined in Eq. \12\ . Its proof 
is very similar to the derivation shown by \Koivisto\ (|2006p and is provided in Appendix Note 

Thus, by storing the created 


= S) 


S't ■’’ in the memory, once {aj, Uoj)~^^ = {aj, Uofa~^i' for i' > i, creating Sj 


can be 


that {aj,Uoj)^' = {aj,Uoj)^^' implies Sj 
I 

avoided and sampling one PcLaj far aj takes only 0{logZ) time. 

On one hand, our strategy will definitely save the running time for these j’s such that n{jf^') 

(the number of all the possible values of {aj, U^j)) is smaller than Nq if every created s\ 
is stored. This is because the running time of sampling Pa^. of aj is only 0{logZ) in at least 


samples out of the overall No samples. (In the worst case, s\ will be created 

for each possible {aj, Uoj)-) For example, when j = n, the number of all the possible values of 
{aj, Uaj) is only n and Zj (the length of the list {Paajz}^) achieves its maximum among all the 
j’s so that sampling one Paa„ for an takes 0{logZn) time in at least No — n samples. Accordingly, 
when j = n, the worst-case running time of sampling the No {an, Poafa families is 0{n{Zn + 
logZfa + {No — n)logZn) = 0{nZn + NologZfa- On the other hand, our strategy usually can also 
save the running time even for these j’s such that the number of all the possible values of{aj, Uo -) is 
larger than No. This is because the probability mass usually is not uniformly distributed among the 
set of all the possible values of {aj, Ua-). Once the majority of probability mass p^ is concentrated 
on rj {aj, U^fa values, where rj is a number smaller than No, the probability that only these Vj 
{aj, Ua ) values appear in all the No order samples is {pfa^°. Accordingly, with the probability 
{pfa^°, the running time of sampling Pa^j for aj is 0{logZj) in at least No — rj samples. As 
a result, the expected running time of sampling the No {aj, Pa^fa families is below 0{[rjZj + 
NologZj]{pfa^° + No{Zj + logZj){l — {pfa^°)); the expected running time of sampling the No 
DAGs is below 0{Yfa^i{\rjZj + NologZj]{pfa^° + No{Zj + logZj){l — (^ 5 ,)^°)}). Typically, 
when m is not small, local score scorefaPoi : D) will not be uniform at all. Correspondingly, it 
is likely that the multinomial probability mass function P{{aj, Uoj)\D) will concentrate dominant 
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probability mass on a small number of {aj, U^j) candidates that these {aj , Pou.)’s having large 
local scores are consistent with. As a result, our time-saving strategy will usually become more 
effective when m is not small. 

Note that we also include the policy of recycling the created Sj ^ sfor our strategy because 
it is possible that all the memory in a computer will be exhausted in order to store all the created 

^ s, especially when n is not small but m is small. (The space complexity of storing all 

1 IS { J\ '! . - ^ j 


s 


the Sj ^ 


s is 0(J2J=i ^).) For this paper, we use a simple recycling 

method as follows. Some upper limit for the total number of the probability intervals (representing 

oi ) ) is pre-specified based on the memory of the used computer. 

Each time such an upper limit is reached during the DAG sampling step of the DDS, which indi- 

cates a large amount of memory has been used to store Sj ^ s, we recycle the currently stored 

Sj according to their usage frequencies which serve as the estimates of P{{aj, Uaj)\D)s. 

[cTj^Ucr -) 

The memory for each infrequently used Sj ■’ will be reclaimed to ensure that at least a pre- 
specified number of probability intervals will be recycled from the memory. In addition, in order to 

(aj ,Ucr ■ ) 

have a better use of each created Sj ^ before it possibly gets reclaimed, we sort the Nq sam¬ 
pled orders according to the posterior p(^ \D) just before executing the DAG sampling step of the 
DDS. The underlying rationale is that ifp(Ai \D) is relatively close to p{Ai' \D), which indicates 
p{Ai,D) is relatively close to p{Aii, D) (sincep^{D) is a constant), due to Eq. ( [7!^ , it is likely that 
-<i and -<i' share some (aj, U^j) components}. (The extreme situation is that if p(Ai \D) equals 
p(Ai' \D), it is very likely that -<i equals -<i> so that -<i and -<ii share every (aj, Ua-).) Thus, -<i 
and ~<i' having similar posteriors tend to be close to each other after the sorting so that it is likely 
GjP<y) 

that the common Sj ^ will be used before the reclamation. Eurthermore, as Nq increases, the 
probability that two orders (out of the No sampled orders) share some (aj, U^j) component(s) in- 

{(Tj,Ua) 

creases. Accordingly, after the sorting, the probability of reusing Sj ^ before its reclamation 
will also increase. a result, the benefit of our time-saving strategy will typically increase when 
No increases. 

The experimental results show that our time-saving strategy for the DAG sampling step of the 
DDS is very effective. Please see the discussion in Section^about jhiTnAc) tmd d{Tj;)AG)> the 
sample mean and the sample standard deviation of the running time of the DAG sampling step of 
the DDS, which are reported in Tableland Table^ 


3.3 IW-DDS Algorithm 

In this subsection we present our DAG sampling algorithm under the general structure-modular 
prior (Eq. Q) by effectively correcting the bias due to the use of the order-modular prior. 

As mentioned in Section[^ p^{f\D) has the bias due to the assumption of the order-modular 
prior. This is essentially because p^{G\D) based on the order-modular prior (Eq. ([^) is different 
frompy^(G\D) based on the structure-modular prior (Eq. ([^). 

In fact, with the common setting that qi{Ui) always equals 1 {qi{Ui) = 1), if pi{Pai) in Eq. ([^ 
is set to be always equal to pi(Pai) in Eq. ([^ (pi(Pai) = pi(Pai)), the following relation holds: 

Pa{G\D) = • I \-pAG\D), (22) 
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where | | is the number of orders that G is consistent with. (The proof of Eq. (22 1 is given in 

Appendix]^) Accordingly, 

■MD) 


pAm = 'ZAO)pAa\D) = 


Ag 


-MG\D). 


Note thatp^ {D) can be computed by the DP algorithm of Koivisto and Sood ( 2004 ) in 0{n^~^^C{m) + 
kn2^) time, and p^{D) can be computed by the DP algorithm of Tian and He (20091 in 0{n^~^^C{m)+ 
/cn2” + 3") time. Thus, if | -<q. \ is known for each sampled Gi (i £ {1,2,, No}), we can use 
importance sampling to obtain a good estimator 


, No 

p^{f\D) = —^f{G,) 


2=1 


Pa{D) 1 

'p^iD) I -<Gi 


(23) 


where each Gi is sampled from our DDS algorithm. Unfortunately, | Ag^ | is #P hard to compute 
for each Gi (Brightwell and Winkler 1991| l; and the state-of-the-art DP algorithm proposed by |Ni-| 


inimaki and Koivisto 


(|2013 1 for computing | Ag^ | takes 0{n2"') time. Therefore, in the following 


we propose an estimator that can be much more efficiently computed than the estimator shown in 


Eq. ( [23| ). 

Becausep^(/|i9)has the bias with respect to (/1D), a good estimator p_<(/|i9)forp^(/|i9) 
typically is not appropriate to be directly used to estimate p^(/|i9). Noticing this problem, Ellis 


and Wong (20081 propose to correct this bias for the Order MCMC method as follows: first run the 
Order MCMC to draw order samples; then for each unique order A out of the sampled orders, keep 
drawing DAGs consistent with A (but only keep unique DAGs) until the sum of joint probabilities of 
these unique DAGs, ^ • p(A, Gi, D), is no less than a pre-specified large proportion (such as 95%) 
of p{-<,D) = YIgca p(a, G, D); finally the resulting union of all the DAG samples is treated as 
an importance-weighted sample for the structural discovery. 

Inspired by the idea of Ellis and Wong ( 2008[ ), we develop our own bias-correction strategy 
which is computationally more efficient and can theoretically ensure the resulting estimator to have 
desirable properties. (Please refer to Remark]^ for detailed discussion.) Our bias-corrected algo¬ 
rithm, termed IW-DDS (Importance-weighted DDS), is as follows: 


• Step 1 (DDS Step): Run the DDS algorithm to draw No DAG samples with the setting that 

qi{Ui) = 1 and pi{Pai) = pi{Pai). 

• Step 2 (Bias Correction Step): Make the union set Q of all the sampled DAGs by eliminating 
the duplicate DAGs. 


Given Q, p^(/|i9), the estimator for the exact posterior of any feature /, can then be constructed 
as 


where 


p^{f\D) = Y,f{G)MG\D)^ 

G&g 


pAg\d) 


pAg,d) 

'^G&gPy^iG, D) 


(24) 


(25) 


14 






































Structure Learning in BNs of Moderate Size by Efficient Sampling 


and p^{G, D) is given in Eq. (|^. 

Since checking the equality of two DAGs takes 0{n) time by using their vector representations, 
with the usage of a hash table, both the expected time cost and the space cost of the bias correction 
step are 0{nNo). Therefore, the expected time cost of our IW-DDS algorithm is + 

+ ii?No + n^^^No), and the required memory space of our IW-DDS algorithm is 0(n2” + 


nNo). 

Note that when each Gi gets sampled, the corresponding joint probability p^{Gi, D) can be 
easily computed and stored with Gi. Therefore, just as constructing the estimator from the DDS, 
constructing the estimator p^{f\D) from the IW-DDS also takes 0{GnjNo) time, where Gnj 
denotes the time cost of determining the structural feature / in a DAG of n nodes. 

While Ellis and Wong ( |2008 1 show the effectiveness of their method in correcting the bias 
merely by the experiments, we first theoretically prove that our estimator has desirable properties as 
follows. 


Theorem 6 For any structural feature f, with respect to the exact posterior p^ {f\D), the estimator 
p^(/|Z)) based on the DAG samples from the IW-DDS algorithm using Eq. ( |24| ) has the following 
properties: 

(i) p./^{f\D) is an asymptotically unbiased estimator for py,[f\D). 

(ii) Pyi{f\D) converges almost surely to p^(/|E>). 

(Hi) The convergence rate of py^{f\D) is o{a^°) for any 0 < a < 1. 

(iv) Define the quantity A = then 

A • pAf\D) < pAf\D) < A • pAf\D) + 1 - A. (26) 


Note that the introduced quantity A = Ylc&g P-a(^^ A G [0,1] essentially 

represents the cumulative posterior probability mass of the DAGs in Q. Eq. ph] ) provides a sound 
interval [A ■ p^{f\D), A ■ p^{f\D) 1 — A] in which py^(f\D) must reside. (The “sound interval” 

is stronger than the concept of “confidence interval” because there is no probability that p^{f\D) is 
outside the sound interval.) The width of the sound interval is (1 — A), where A is a nondecreasing 
function of No (because if we increase the original No to a larger A' and sample additional A' — No 
DAGs in the DDS step, the resulting G' is always the superset of the original Q). Thus, in the 
situations where m (the number of data instances) is not very small, it is possible for A to approach 
1 by a tractable number No of DAG samples so that a desired small-width interval for p^(f\D) can 
be obtained. (Please refer to Section]^ for the corresponding experimental results.) Also note that 
Eq. (261 can be expressed in the following equivalent form: 


-(1 - A)pAf\D) < pAf\D) - pAf\D) < (1 - A)(l - pAf\D)), 
which gives the bound for the estimation error p^(/|i2) — pAf\D). 


(27) 


Remark 7 A comparison between our bias-correction strategy and the one of Ellis and Wong 


(20081 


Our bias-correction strategy used in the IW-DDS solves the computation problem existing in the 


idea of Ellis and Wong (2008) and ensures the desirable properties of our estimator p^{f\D) stated 
in Theorem^ 

Since in the Order MCMC, sampling an order is much more computationally expensive than 
sampling a DAG given an order, the strategy of\Ellis and Wong (2008) emphasizes making the 
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full use of each sampled order that is, keeping drawing DAGs consistent with each sampled -< 
until the sum of joint probabilities for the unique sampled DAGs, ^)’ than 

a large proportion (such as 95%) ofp{A, D). Unfortunately, such a strategy has a computational 
problem when the number of variables n is not small and the number of data instances m is small. 
Because there are super-exponential number ) of DAGs (with the maximum in-degree k) 

consistent with each order \Friedman and Roller] \2003} , it is possible that a non-negligible portion 
of probability mass p{A,D) will be distributed almost uniformly to a majority of these consistent 
DAGs when m is small. Consequently, Nq, the required number of DAGs sampled per each sampled 
order -<, will be extremely large, leading to large computation costs. For sampling Nq DAGs 
consistent with each sampled order its expected time cost is + nklog{n)NQ) (even 

if a time-saving strategy like the one described in Remark^is used) and its memory requirement 
is 0{nNQ). If the memory requirement exceeds the memory of the running computer, the hard 
disk has to be used to temporarily store the sampled DAGs in some way. (We notice that Ellis 


and Wong {2008) limit their experiments to the data sets with at most 14 variables.) If we take 
the data set “Child” ( Tsamardinos et al\ 2006\ with n = 20 and m = 100/or example, for an 
order -< randomly sampled by our order sampling algorithm, our experiment shows that 1 x 10^ 
DAGs (which contain 932,137 unique DAGs) need to be sampled to let the ratio 'f2iP{A, Gi, D) 
/p{a,D) reach 94.071%; 1.5 x 10^ DAGs (which contain 1,204,262 unique DAGs) need to be 
sampled to let the ratio f^cich 94.952%. To address this problem, based 

on the efficiency of our order sampling algorithm, our strategy samples only one DAG from each 
sampled order in the DDS step so that the large computation costs per each sampled order are 


avoided for any data set. Meanwhile, unlike the strategy of Ellis and Wong {2008), our strategy 
does not delete the duplicate order samples. Therefore, if an order -< gets sampled j (> 1) times 
in the order sampling step, essentially j DAGs will be sampled for such a unique order in the DAG 
sampling step. Thus, j, the number of occurrences, implicitly serves as an importance indicator for 
-< among the orders. 

Eurthermore, the strategy of Ellis and Wong 11(2008 1 can not guarantee that the sampled DAGs 
are independent, even if large computation costs are spent in sampling a huge number of DAGs 
per each sampled order. This is essentially because multiple DAGs sampled from a fixed order 
according to the strategy of Ellis and Wong ( 20Q8| ) are not independent. Eor example, given that a 
DAG G with an edge X ^ Y gets sampled from an order -< , which implies that node X precedes 
node Y in the given order -<, then the conditional probability that any DAG G' with a reverse edge 
Y ^ X gets sampled under the fixed order -< becomes zero, so that G and G' are not independent. 
In general, once the number of sampled orders is fixed, even if the number of sampled DAGs per 
each sampled order keeps increasing, every DAG that is consistent with none of the sampled orders 
will still have no chance of getting sampled. In contrast, the sampling strategy in our IW-DDS is 
able to guarantee the property that all the DAGs sampled from the DDS step are independent, which 
has been stated in Theorem Such a property actually is a key to ensuring the good properties of 
our estimator py^{f\D) stated in Theorem^ 


The competing state-of-the-art algorithms that are also applicable to BNs of moderate size are 
the Hybrid MCMC method ( Eaton and Murph^|2007 1 and the iT-best algorithm ( |Tian et ah 20101. 
The first competing method, the Hybrid MCMC, includes the DP algorithm of [Koivisto (20061 
(with time complexity 0{n’‘~^^G{m)-\- kn2^) and space complexity 0(n2”)) as its first phase and 
then uses the computed posteriors of all the edges to make the global proposal for its second phase 
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(MCMC phase). When its MCMC phase eventually converges, the Hybrid MCMC will correct 
the bias coming from the order-prior assumption and provide DAG samples according to the DAG 
posterior so that the estimator p^{f\D) can be constituted using Eq. (j^ for any feature /. The 
Hybrid MCMC has been empirically shown to converge faster than both the Structure MCMC and 


the Order MCMC so that more accurate structure learning performance can be obtained (Eaton and 
Murphy[ 2007 j ). Note that since the REV-MCMC method ( Grzegorczyk and Husmeier| 20081 is 
shown to be only nearly as efficient as the Order MCMC in the mixing and convergence, the Hybrid 
MCMC is expected to converge faster than the REV-MCMC method as long as n is moderate so 
that the Hybrid MCMC is applicable. (But the REV-MCMC method has its own value in learning 
large BNs since all these methods using some DP technique (including the Hybrid MCMC, the K- 
best algorithm and our IW-DDS method) are infeasible for a large n due to the space cost.) One 
limitation of the Hybrid MCMC is that it can not obtain the interval for p^{f\D) as specified by 
Theorem [^(iv). Additionally, the convergence rate of the estimator from the Hybrid MCMC is not 
theoretically provided by its authors. 

The second competing method, the iT-best algorithm, applies DP technique to obtain a col¬ 
lection Q of DAGs with the K best scores and then uses these DAGs to constitute the estimator 

One advantage of the iT-best algorithm is that its estimator also has 


p^{f\D) by Eq. (24i and \ 
the property specified as Theorem[^(iv) so that it can provide the sound interval for p^ {f\D) just as 
our IW-DDS. However, the iT-best algorithm has time complexity 0{n^~^^C{m)+ T'{K)n2'^~^) 
and space complexity 0{KnT^), where T'{K) is the time spent on the best-first search for K solu¬ 


tions and T'{K) has been shown to be 0{KlogK) by Elerova et al. (20121. Thus, the increase in K 


will dramatically increase the computation costs of the iT-best algorithm when n is not small. As a 
result, to obtain an interval width similar to the one from our IW-DDS, much more time and space 
costs are required for the iL-best. In our experiments using an ordinary desktop PC, the computation 
problem becomes severe for re > 19 since K can only take some small values (such as no more than 
40) before the iT-best algorithm exhausts the memory. Accordingly, A obtained from the AT-best 
is usually smaller than the one from our IW-DDS (so that the interval from the TL-best is usually 


wider) even if K is set to reach the memory limit of a computer. (Please refer to Section 4.2 for 
detailed discussion.) 


4. Experimental Results 

We have implemented our algorithms in a C-f-f language tool called “BNEearner” and run sev¬ 
eral experiments to demonstrate its capabilities. (BNEearner is available at http : / /www. cs . 
iastate . edu/ ~ jtian/ Sof tware/BNLearner/BNLearner . htm,) Our tested data sets 
include ten real data sets from the UCI Machine Eeaming Repository ( [Asuncion and Newman 
2007||: “Tic-Tac-Toe,” (which is also simply called “T-T-T,”) “Glass,” “Wine,” “Housing,” “Credit,” 


“Zoo,” “Eetter,” “Tumor,” “Vehicle,” and “German”. Our tested data sets also include three syn¬ 
thetic data sets: the first one is a synthetic data set “Synl5” generated from a gold-standard 15-node 
Bayesian network built by us; the second one is a synthetic data set ‘Tnsurl9” generated from a 
19-node subnetwork of “Insurance” Bayesian network (Binder et al. 1997| ); and the third one is a 
synthetic data set “Child” from a 20-node “Child” Bayesian network used by [Tsamardinos et al. 
(2006). All the data sets contain only discrete variables (or are discretized) and have no missing 


values (or have its missing values filled in). Eor the four data sets (“Synl5,” “Eetter,” ‘Tnsurl9” 
and “Child”), since a large number of data instances are available, we also vary m (the number of 
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Name 

n 

m 

PO-MCMC 
/i(SAD) 6-(SAD) 

DOS 

/i(SAD) d-(SAD) 

DDS 

/i(SAD) dCSAD) 

T-T-T (Tic-Tac-Toe) 

10 

958 

0.5174 

0.1280 

0.1350 

0.0257 

0.1547 

0.0378 

Glass 

11 

214 

0.0696 

0.0249 

0.0230 

0.0067 

0.0529 

0.0076 

Wine 

13 

178 

0.1616 

0.0403 

0.0505 

0.0138 

0.0839 

0.0137 

Housing 

14 

506 

0.3205 

0.1303 

0.0691 

0.0146 

0.1150 

0.0117 

Credit 

16 

690 

0.4549 

0.2495 

0.0581 

0.0221 

0.1071 

0.0165 

Zoo 

17 

101 

0.6079 

0.1809 

0.1020 

0.0130 

0.2756 

0.0137 

Tumor 

18 

339 

0.6059 

0.1849 

0.0877 

0.0226 

0.2050 

0.0180 

Vehicle 

19 

846 

6.9774 

8.7960 

0.0200 

0.0042 

0.0547 

0.0096 

German 

21 

1,000 

2.8802 

2.0191 

0.0994 

0.0295 

0.1298 

0.0338 

SynlS 

15 

100 

0.9024 

0.2258 

0.1246 

0.0142 

0.2622 

0.0190 



200 

0.6449 

0.1569 

0.0975 

0.0163 

0.2228 

0.0172 



500 

0.3424 

0.1214 

0.0651 

0.0153 

0.1116 

0.0126 



1,000 

0.1558 

0.0496 

0.0515 

0.0128 

0.0724 

0.0118 



2,000 

0.0465 

0.0209 

0.0359 

0.0075 

0.0473 

0.0071 



5,000 

0.0217 

0.0144 

0.0196 

0.0064 

0.0247 

0.0086 

Letter 

17 

100 

0.9530 

0.1285 

0.1559 

0.0210 

0.2948 

0.0229 



200 

0.3854 

0.0825 

0.0827 

0.0153 

0.1758 

0.0142 



500 

0.4369 

0.1529 

0.0755 

0.0157 

0.1326 

0.0107 



1,000 

0.3007 

0.1254 

0.0537 

0.0140 

0.0828 

0.0171 



2,000 

1.3740 

0.9177 

0.0895 

0.0238 

0.1386 

0.0288 



5,000 

0.0669 

0.0139 

0.0218 

0.0059 

0.0292 

0.0088 

Insurl9 

19 

100 

0.6150 

0.1995 

0.0947 

0.0179 

0.1575 

0.0213 



200 

0.4428 

0.1319 

0.0663 

0.0111 

0.1024 

0.0162 



500 

0.2757 

0.1127 

0.0436 

0.0140 

0.0594 

0.0099 



1,000 

0.4539 

0.3031 

0.0301 

0.0088 

0.0422 

0.0134 



2,000 

0.0100 

0.0073 

0.0073 

0.0042 

0.0079 

0.0029 



5,000 

0.0110 

0.0100 

0.0094 

0.0041 

0.0116 

0.0043 

Child 

20 

100 

0.4997 

0.1153 

0.0745 

0.0147 

0.1772 

0.0146 



200 

0.1896 

0.0528 

0.0425 

0.0081 

0.0982 

0.0101 



500 

0.2385 

0.0702 

0.0434 

0.0068 

0.0816 

0.0123 



1,000 

0.1079 

0.0525 

0.0307 

0.0093 

0.0406 

0.0080 



2,000 

0.0864 

0.0521 

0.0231 

0.0090 

0.0275 

0.0083 



5,000 

0.0938 

0.0539 

0.0235 

0.0078 

0.0246 

0.0066 


Table 1: Comparison of the PO-MCMC, the DOS & the DDS in Terms of SAD 


instances) to see the corresponding learning performance. (All the data cases are also included in 
the tool of BNLearner.) All the experiments in this section were run under Linux on one ordinary 
desktop PC with a 3.0GHz Intel Pentium processor and 2.0 GB memory if no extra specification is 
provided. In addition, the maximum in-degree k is assumed to be 5 for all the experiments. 


4.1 Experimental Results for the DDS 


In this subsection, we compare our DDS algorithm with the Partial Order MCMC method (Niini- 


maki et al. 20111, the state-of-the-art learning method under the order-modular prior. 


The Partial Order MCMC (PO-MCMC) method is implemented in BEANDisco, a C-i-i- lan¬ 
guage tool provided by Niinimaki et al. ( 2011| l. (BEANDisco is available at http : //www. cs . 
helsinki . f i/u/tzniinim/BEANDisco/ .) The current version of BEANDisco can only 
estimate the posterior of an edge feature, but as Niinimaki et al.| ( |2011[ ) have stated, the PO-MCMC 
readily enables estimating the posterior of any structural feature by further sampling DAGs consis¬ 
tent with an order. 


18 




























Structure Learning in BNs of Moderate Size by Efficient Sampling 


Since n (the number of the variables) in each investigated data case is moderate, we are able 


to use REBEL, a C++ language implementation of the DP algorithm of Koivisto (20061, to get the 


exact posterior of every edge under the assumption of the order-modular prior. (REBEL is avail¬ 
able at http : //www. cs . helsinki . f i/u/mkhkoivi/REBEL/ ,) Therefore we can use the 
criterion of the sum of the absolute differences (SAD) ( Eaton and Murph^ 20071 to measure the 
feature learning performance for each data case: 

SAD = ^|p(/|D)-p(/|77)|, 

/ 

where p{f\D) is the exact posterior of the investigated feature /, and p{f\D) is the corresponding 
estimator. In Section|4.1l SAD is essentially \p^{i j\D) — p^{i j\D)\, since the investi¬ 
gated feature is the edge feature i ^ j under the order-modular prior. A smaller SAD will indicate 
a better performance in structure discovery. Note that the criterion SAD is closely related to another 
criterion MAD (the mean of the absolute differences) since MAD = SAD/(n(n — 1)). Thus, for 
each data case the conclusion based on the comparison of SAD values is the same as the one based 
on the comparison of MAD values since n(n — 1) is just a constant for each data case. 


Eor fair comparison, in our algorithms we used the K2 score (Heckerman et al. 19951 and we 
set qi{Ui) = 1 and pi{Pai) = l/(|p~^|) for each i, Ui, Pat, where |Paj| is the size of the set Pai, 
since such a setting is used in both BEANDisco and REBEL. 

Eor the setting of the PO-MCMC, according to the suggestion for the optimal setting from 


Niinimaki et al. (20111, we set the bucket size b to be 10 for all the data cases except Tic-Tac-Toe. 
The bucket size b was set to be 9 for the data case Tic-Tac-Toe because Tic-Tac-Toe has only 10 
variables so that the setting 5=10 will cause the tool BEANDisco to throw a run-time error. We 
ran the first 10,000 iterations for “burn-in,” and then took 200 partial order samples at intervals 
of 50 iterations. Thus, there were 20, 000 iterations in total. (The time cost of each iteration in 
the PO-MCMC is 0(n^+^ -|- n^2^n/5).) In the PO-MCMC, for each sampled partial order Pi, 
p{f\D, Pi) is obtained by p{D, /, Pi)/p{D, Pi) = p{D, /, Pi)/p{D, f = 1, Pi), where p{D, f, Pi) 
— Yl~!,DPi Z]gc-< /(^) G)p{D\G). The notation Y-oPi means that all the total order ^’s 

that are linear extensions of the sampled partial order Pi will be included to obtain p{D, f, Pi). 
Eor example, for a data set with n = 20, since our bucket size 5 = 10, there are 20!/(10!l0!) = 
184, 756 total orders that will be included for each sampled partial order Pj. The inclusion of the 
information of a large number of total orders consistent with each sampled partial order gives great 
learning power to the PO-MCMC method; and such an inclusion can be efficiently computed by 


the algorithm of Parviainen and Koivisto (20101 with the assumptions of the order-modular prior 
and the maximum in-degree k. Einally, for the PO-MCMC, the estimated posterior of each edge is 
computed using p^{f\D) = (l/T) Yj=iPif\D, Pi)- 

Because the to-be-leamed feature is the edge feature, we can also use our DOS algorithm for 
the comparison. Eor both the DOS algorithm and the DDS algorithm, we set No = 20, 000, that 
is, 20,000 (total) orders were sampled. Theoretically, we expect that the learning performance of 
the DOS should be better than the performance of the DDS because the additional approximation 
coming from the DAG sampling step is avoided by the DOS. By listing the performance of the 
DOS, we mainly intend to examine how much the performance of the DDS decreases due to the 
additional approximation from sampling one DAG per order. However, since the DDS but not the 
DOS is capable of learning non-modular features, the comparison between the PO-MCMC method 
and the DDS method is our main task. 


19 

















He, Tian and Wu 


Name 

n 

m 

PO-MCMC 

KT) 

DOS 

KT) 

DDS 

A(T) 

A(Tdp) 

o{Tdp) 

A(^or(i) 

DDS 

^{Tord) 

P-Tdag) 

^(Tdag) 

T-T-T 

10 

958 

104.30 

1.96 

1.73 

1.42 

0.0159 

0.24 

0.0066 

0.06 

0.0017 

Glass 

11 

214 

222.02 

1.52 

1.25 

0.89 

0.0136 

0.25 

0.0076 

0.09 

0.0013 

Wine 

13 

178 

374.17 

2.56 

2.53 

1.63 

0.0121 

0.35 

0.0039 

0.53 

0.0024 

Housing 

14 

506 

510.65 

4.92 

4.54 

3.88 

0.0143 

0.40 

0.0033 

0.23 

0.0020 

Credit 

16 

690 

962.97 

30.90 

30.93 

29.57 

0.2118 

0.44 

0.0055 

0.90 

0.0122 

Zoo 

17 

101 

1,331.80 

13.75 

21.90 

12.05 

0.0837 

0.62 

0.0039 

9.20 

0.1156 

Tumor 

18 

339 

1,856.03 

44.99 

60.21 

43.33 

1.0763 

0.72 

0.0052 

16.12 

0.2941 

Vehicle 

19 

846 

2,683.91 

149.08 

149.23 

147.35 

1.2863 

0.65 

0.0079 

1.20 

0.0219 

German 

21 

1,000 

4,887.33 

333.43 

356.17 

330.64 

1.3304 

0.97 

0.0087 

24.52 

0.4441 

Synl5 

15 

100 

677.29 

4.82 

7.13 

3.49 

0.0824 

0.50 

0.0021 

3.12 

0.0337 



200 

677.47 

5.91 

8.91 

4.59 

0.0473 

0.47 

0.0044 

3.83 

0.0258 



500 

686.51 

8.56 

8.44 

7.41 

0.1911 

0.48 

0.0100 

0.53 

0.0050 



1,000 

716.31 

13.01 

12.52 

11.78 

0.0927 

0.48 

0.0115 

0.24 

0.0022 



2,000 

731.50 

21.70 

21.24 

20.58 

0.5310 

0.49 

0.0086 

0.15 

0.0016 



5,000 

731.05 

48.01 

47.26 

46.63 

0.4110 

0.47 

0.0031 

0.14 

0.0008 

Letter 

17 

100 

1,322.43 

16.35 

21.91 

14.64 

0.2160 

0.64 

0.0036 

6.60 

0.0497 



200 

1,315.01 

19.74 

20.46 

18.14 

0.0809 

0.55 

0.0026 

1.74 

0.0234 



500 

1,338.33 

27.97 

28.21 

26.35 

0.0489 

0.51 

0.0066 

1.32 

0.0130 



1,000 

1,343.88 

39.45 

39.03 

38.11 

0.3011 

0.47 

0.0051 

0.43 

0.0089 



2,000 

1,358.29 

61.75 

61.44 

60.52 

0.5109 

0.47 

0.0078 

0.42 

0.0063 



5,000 

1,610.37 

126.53 

126.49 

125.67 

0.7967 

0.52 

0.0058 

0.27 

0.0053 

Insurl9 

19 

100 

2,616.56 

53.39 

86.06 

51.06 

0.2520 

0.86 

0.0091 

34.11 

0.9630 



200 

2,633.44 

62.12 

77.44 

59.95 

0.3025 

0.84 

0.0083 

16.61 

0.2278 



500 

2,680.85 

80.70 

85.03 

77.89 

0.7618 

0.79 

0.0082 

6.32 

0.0535 



1,000 

2,734.10 

106.37 

109.39 

104.63 

1.0958 

0.89 

0.0122 

3.85 

0.0296 



2,000 

2,915.60 

155.05 

158.31 

154.26 

3.7703 

0.90 

0.0154 

3.13 

0.0155 



5,000 

3,445.84 

297.31 

300.51 

297.41 

4.7737 

0.96 

0.0090 

2.11 

0.0091 

Child 

20 

100 

3,710.49 

102.42 

181.38 

99.71 

0.3497 

1.07 

0.0109 

80.56 

1.1980 



200 

3,717.10 

112.68 

168.48 

110.05 

0.1793 

1.04 

0.0078 

57.36 

0.5335 



500 

3,757.76 

136.98 

193.11 

134.32 

0.4652 

1.07 

0.0111 

57.69 

0.7276 



1,000 

3,799.47 

174.18 

186.15 

171.54 

1.9832 

1.08 

0.0104 

13.50 

0.9244 



2,000 

4,018.03 

241.48 

254.26 

238.37 

2.4952 

1.15 

0.0214 

14.71 

0.4833 



5,000 

4,531.20 

443.54 

455.30 

441.64 

4.8785 

1.16 

0.0068 

12.47 

0.4113 


Table 2: Comparison of the PO-MCMC, the DOS & the DDS in Terms of Time (Time Is in Seconds) 


Table shows the experimental results in terms of SAD for each data case with n variables and 
m instances, while Table lists the running time costs corresponding to Table [T] For each of the 
three methods, we performed 15 independent runs for each data case. The sample mean and the 
sample standard deviation of the 15 SAD values of each method, denoted by /l(SAD) and (t(SAD) 
respectively, are listed along each method in Table [T] Correspondingly, the sample mean of the total 
running time Tt of each method, denoted by is shown in Table (Precisely speaking, the 

reported total running time Tt of the DDS method includes both the time of running the three steps 
of the DDS and the relatively tiny 0{No) time cost of computing p^{f\D) for each edge / using 
Eq. Q at the end. Similarly, the reported total running time Tt of the DOS method also includes the 
relatively tiny 0{No) time cost of computing p^{f\D) for each edge / using Eq. ( [Tt] ) and Eq. ( [Ts] ) 
at the end.) In addition, the sample mean and the sample standard deviation of the running time of 
three steps of the DDS (including the DP step, the order sampling step and the DAG sampling step), 
denoted by piTop), o{Tdp), fi{Tord), ^{Tord), K^dag) and a{TDAG) respectively, are listed in 
the last six columns in TableNote that we still show jl{Tpip), the mean running time of the DP 
step of the DDS in the 15 independent runs, though the DP step is not a random algorithm at all. 
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The running time of the DP step is not exactly the same in each run due to the randomness from 
uncontrolled factors such as the internal status of the computer. By showing //(Tdp), we can clearly 
see the percentage of the total running time that the DP step typically takes by comparing fi{TDp) 
and 


Tables [T] and clearly illustrate the performance advantage of our DDS method over the PO- 
MCMC method. The overall time costs of our DDS based on 20,000 DAG samples are much 
smaller than the corresponding costs of the PO-MCMC method based on 20, 000 MCMC iterations 
in the partial order space. Using much shorter time, our DDS method has its /i(S AD) much smaller 
than /i(S AD) from the PO-MCMC method for 28 out of all the 33 data cases. The five exceptional 
cases are Glass, Synl5 with m = 2, 000, Synl5 with m = 5, 000, Insurl9 with m = 2, 000, and 
Insurl9 with m = 5, 000. (In both Glass and Insurl9 with m = 2, 000, /l(SAD) using our DDS 
method is still smaller than the one using the PO-MCMC method but their difference is not very 
large relatively to (t(SAD) from the PO-MCMC method.) Furthermore, since both /t(SAD) and 
(t(SAD) are given, by the two-sample t test with unequal variances ( Casella and BergeT} |2002| , we 
can conclude with strong evidence (at the significance level 5 x 10“^) fhat the real mean of SAD 
using our DDS method is smaller than the real mean of SAD using the PO-MCMC method for each 
of the 28 data cases. For the exceptional data case Glass, the p-value of the t test is 0.0120 so that 
we can conclude at the significance level 0.05 fhat the real mean of SAD using our DDS method 
is smaller than the real mean of SAD using the PO-MCMC method. For each of the other four 
exceptions, by the same t test we accept (with the p-value > 0.2) the null hypothesis that there is 
no significant difference in the real means of SAD from two methods. Thus, the advantage of our 
DDS algorithm over the PO-MCMC method in learning Bayesian networks of a moderate n can be 
clearly seen, though the value of the PO-MCMC method still remains for larger n where our DDS 
algorithm is infeasible. 


In terms of the total running time of the DDS algorithm. Table shows that the running time 
of the DP step always accounts for the largest portion. The running time of the DAG sampling step 
is less than 81 seconds to get 20, 000 DAG samples for all the 33 cases. Though both the order 
sampling step and the DAG sampling step involve randomness, the variability of their running time 
is actually small. This can be seen from the ratio of a{Tord) to fJ-{Tord) (which is always less than 
3.04% for all the 33 cases) and the ratio of ^{Tpag) to fj-iTpAc) (which is always less than 6.85% 
for all the 33 cases). The ratio of fiiTpAc) to fi{Tord) ranges from 0.25 to 75.29 across the 33 cases, 
which is much smaller than the upper bound of the ratio of 0{n^~^^No) to 0{'n?No). This indicates 
that our time-saving strategy introduced in Remarkj^can effectively reduce the running time of the 
DAG sampling step. In addition, the running time of the DAG sampling step often decreases further 
when m increases, which can be clearly seen from all the four data sets (Synl5, Letter, Insurl9 and 
Child) with different values of m. Take the data set Letter for example, when m increases from 100 
to 1,000, the corresponding {^{Tdag) decreases from 6.60 to 0.43 second, a 93.48% of decrease. 
In summary, the effectiveness of our time-saving strategy introduced in Remark has been clearly 
shown in Table |2] 


Finally, we present the experimental results for the DDS by varying the sample size. We first 
choose the data case Letter with m = 500 as an example. For the DDS, we tried sample size No 
= 5,000 • i, where i G {1,2,..., 10}. For each i, we independently ran the DDS 15 times to get 
the sample mean and the sample standard deviation of SAD for the (directed) edge features. For the 
PO-MCMC, with the bucket size b = 10, we ran totally 5, 000 • i MCMC iterations in the partial 
order space, where f G {1, 2,..., 10}. For each i, we discarded the first 2, 500 • i MCMC iterations 
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Letter (m = 500) 



Figure 1: Plot of the SAD Performance of 
the PO-MCMC and the DDS for Letter (m = 
500) 


Letter (m = 500) 

40001-^^-1-^^- 



H“ 29 - ^ , - - " ' 

i28^ 

27 I-1-:-:-1-1-:-:-1- 

123456789 10 

Figure 2: Plot of the Total Running Time of 
the PO-MCMC and the DDS for Letter (m = 
500) 


for “bum-in” and set the thinning parameter to be 50 so that 50 • i partial orders finally got sampled. 
Again, for each i, we independently ran the PO-MCMC 15 times to get the sample mean and the 
sample standard deviation of SAD for the edge features. 

Figure [T] shows the SAD performance of the two methods with each i in terms of the edge 
features, where an error bar represents one sample standard deviation (t(SAD) across 15 runs from 
a method (the PO-MCMC or the DDS) at each i. For example. Figure shows that when i = A, 
/i(SAD) = 0.1326 and d(SAD) = 0.0107 from our DDS algorithm; while /i(SAD) = 0.4369 and 
(t(SAD) = 0.1529 from the PO-MCMC method. This exactly matches the results previously shown 
in Table [T] Correspondingly, Figureshows (the sample mean of the total running time) of 
the PO-MCMC and the DDS with each i, where the running time is in seconds. The advantage 
of our DDS can be clearly seen by combining Figures and For each i G {1, 2,..., 10}, the 
real mean of SAD from the DDS is significantly smaller than the one from the PO-MCMC with the 
p-value < 5 X 10“^ returned by the two-sample t test (with unequal variances). In terms of the 
running time, the total running time of the DDS is very short relative to the one of the PO-MCMC. 
For example, of the DDS increases with respect to i and reaches only 29.55 seconds at z = 10. 
This is shorter than 9% of fL{Tt) of the PO-MCMC at z = 1, which is 336.09 seconds. Therefore, 
the learning performance of the DDS with each sample size is significantly better than the one of 
the PO-MCMC for the data case Letter with m = 500. 

We also performed the experiment with the same experimental settings for the data cases Tic- 
Tac-Toe, Wine, Child with m = 500, and German. Please refer to the supplementary material for the 
experimental results. (The supplementary material is available at http : / /www. cs . iastate . 
edu/~jtian/Software/BNLearner/BN_Learning_Sampling_Supplement.pdf, ) 
The same conclusion about the learning performance can be clearly drawn by examining the figures 
shown in fhe supplemenfary maferial. 


4.2 Experimental Results for the IW-DDS 


In this subsection, we compare our IW-DDS algorithm with the Hybrid MCMC (i.e., DP-i-MCMC) 
method ( Eaton and Murph)^ 20071 and the iT-best algorithm ( Tian et ahj 20101, two state-of-the-art 
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Name 

n 

m 

DP 

SAD 

DP+MCMC 
/i{SAD) (3-(SAD) 

A'-best 

SAD 

IW-DDS 

/i(SAD) d-(SAD) 

<! 

IW-DDS 

/(A) (t(A) 

T-T-T 

10 

958 

0.1651 

15.0079 

2.9877 

1.4194 

0.0227 

0.0102 

6.943E-01 

9.935E-01 

8.636E-04 

Glass 

11 

214 

1.5444 

0.3587 

0.4599 

0.0904 

0.0381 

0.0019 

9.780E-01 

9.901E-01 

6.368E-04 

Wine 

13 

178 

1.4786 

0.4605 

0.2968 

0.2011 

0.1041 

0.0075 

9.023E-01 

9.670E-01 

1.700E-03 

Housing 

14 

506 

5.6478 

8.0000 

3.3408 

9.1179 

4.8276 

0.0624 

2.880E-02 

1.096E-01 

1.200E-03 

Credit 

16 

690 

4.0580 

5.0261 

2.3482 

5.1492 

2.9148 

0.0336 

5.010E-02 

1.793E-01 

1.700E-03 

Zoo 

17 

101 

8.2142 

32.4189 

10.0953 

35.1215 

19.7025 

4.0767 

3.815E-08 

1.652E-11 

1.211E-11 

Tumor 

18 

339 

5.1536 

17.5104 

7.9198 

20.5793 

10.3139 

0.6950 

2.940E-05 

3.619E-06 

3.586E-07 

Vehicle 

19 

846 

3.5759 

3.8234 

0.4011 

3.3683 

0.4648 

0.0194 

5.387E-01 

9.432E-01 

2.600E-03 

German 

21 

1,000 

3.7261 

5.0207 

3.2223 

5.5902 

0.9891 

0.0449 

7.800E-02 

7.422E-01 

7.200E-03 

Synl5 

15 

100 

4.9321 

12.8705 

7.6384 

11.8685 

10.1341 

0.1495 

1.604E-04 

1.183E-04 

4.664E-06 



200 

3.2557 

4.5090 

2.5875 

7.5232 

4.9079 

0.0583 

2.700E-03 

7.300E-03 

1.265E-04 



500 

6.9798 

5.5466 

1.9175 

4.4379 

4.2965 

0.1619 

1.526E-01 

2.388E-01 

4.900E-03 



1,000 

1.3000 

0.3974 

0.3299 

0.0848 

0.0498 

0.0048 

9.699E-01 

9.843E-01 

8.909E-04 



2,000 

1.7192 

1.8263 

1.7095 

0.3701 

0.1081 

0.0147 

8.521E-01 

9.703E-01 

3.300E-03 



5,000 

1.9473 

0.0304 

0.0094 

8.89E-04 

0.0022 

0.0002 

9.998E-01 

9.994E-01 

9.821E-05 

Letter 

17 

100 

9.2140 

27.1507 

4.0940 

24.4313 

15.8780 

0.2764 

2.908E-04 

1.621E-04 

5.336E-06 



200 

7.2855 

15.1587 

3.5615 

9.4512 

6.7936 

0.1191 

7.800E-03 

1.220E-02 

3.341E-04 



500 

6.0961 

3.4637 

4.6789 

1.7237 

0.6347 

0.0119 

6.948E-01 

8.808E-01 

3.000E-03 



1,000 

0.6394 

0.1761 

0.0166 

0.0837 

0.0766 

0.0039 

9.834E-01 

9.864E-01 

8.678E-04 



2,000 

2.3913 

3.5085 

3.1132 

2.0976 

0.1338 

0.0213 

6.859E-01 

9.756E-01 

2.700E-03 



5,000 

0.8407 

0.1182 

0.0442 

0.0160 

0.0072 

0.0005 

9.948E-01 

9.972E-01 

2.077E-04 

lnsurl9 

19 

100 

5.3356 

9.4318 

3.9576 

11.7779 

6.5062 

0.0891 

6.000E-03 

2.010E-02 

4.767E-04 



200 

5.9844 

5.5465 

2.7295 

2.2572 

1.4630 

0.0557 

4.495E-01 

7.049E-01 

1.120E-02 



500 

1.8274 

0.3605 

0.2287 

0.4970 

0.1328 

0.0105 

7.464E-01 

9.379E-01 

3.000E-03 



1,000 

1.7386 

0.2186 

0.0762 

0.7498 

0.0623 

0.0111 

5.866E-01 

9.785E-01 

2.600E-03 



2,000 

1.2737 

0.1217 

0.0380 

0.0174 

0.0062 

0.0012 

9.900E-01 

9.976E-01 

4.864E-04 



5,000 

1.9511 

0.1765 

0.0594 

0.0103 

0.0092 

0.0010 

9.970E-01 

9.973E-01 

2.086E-04 

Child 

20 

100 

6.9783 

11.8987 

3.1086 

11.6189 

7.0304 

0.0909 

7.848E-04 

2.700E-03 

1.066E-04 



200 

3.2826 

4.7066 

4.3749 

5.0729 

2.8510 

0.0192 

4.800E-03 

1.506E-01 

1.200E-03 



500 

2.5580 

2.4716 

1.3489 

1.5304 

0.5416 

0.0305 

3.582E-01 

8.222E-01 

5.100E-03 



1,000 

2.4708 

2.6061 

2.2909 

0.7066 

0.1499 

0.0198 

7.013E-01 

9.545E-01 

3.400E-03 



2,000 

2.3330 

1.4286 

1.2290 

1.5279 

0.0662 

0.0161 

6.509E-01 

9.828E-01 

2.800E-03 



5,000 

2.0365 

1.2533 

1.7313 

0.8783 

0.0150 

0.0013 

8.209E-01 

9.940E-01 

4.696E-04 


Table 3: Comparison of the DP+MCMC, the iT-best & the IW-DDS in Terms of SAD 


methods that can estimate the posteriors of any features without the order-modular prior assumption. 
The implementation of the Hybrid MCMC (called BDAGL) and the implementation of the AT-best 
(called KBest) are both made available online by their corresponding authors. (BDAGL is avail¬ 
able at http : //www. cs . ubc . ca/ ~murphyk/Software/BDAGL/ ,) (KBest is available at 
http://www.cs.iastate.edu/'jtian/Software/UAI-10/KBest.htm,) 

Since n in each investigated data case is moderate, we are able to use POSTER, a C-f-f language 
implementation of the DP algorithm of Tian and He (20091 to get (z —F j | i9), the exact posterior 
of each edge i ^ j under the structure-modular prior. (POSTER is available at http : / /www. 
cs.iastate.edu/~jtian/ Software/UAI-0 9/Poster . htm.) Therefore we can use the 
SAD criterion (Ylij |P/(* j\^) ^ measure the performance of these three 

methods in the structural learning for each data case. 

Eor fair comparison, in our algorithm we used the BDeu score ( jHeckerman et al. 19951 with 
equivalent sample size 1 and set pi{Pai){= pi{Pai)) = 1, since these settings are also used in 
POSTER and the implementation of the AT-best algorithm. 

As for the DP-fMCMC, we note that most part of its implementation in BDAGE tool is written 
in Matlab, whereas both the K-best and the IW-DDS are implemented in C-f-f. In order to make 
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Name 

n 

m 

DP+MCMC 

A^-best 

Tt 

IW-DDS 

KTi) 

pTop) 

o(Tdp) 

IW-DDS 

pTord) o{Tord) 

K^dag) 

<i(TDAG) 

T-T-T 

10 

958 

1.29+ 1,032.40 

8.37 

3.28 

1.27 

0.0060 

1.57 

0.0109 

0.44 

0.0128 

Glass 

11 

214 

1.04+ 1,037.26 

18.13 

1.72 

0.98 

0.0210 

0.50 

0.0054 

0.24 

0.0018 

Wine 

13 

178 

2.44+ 1,127.80 

141.20 

3.52 

2.15 

0.0199 

0.63 

0.0061 

0.74 

0.0057 

Housing 

14 

506 

4.83 + 1,421.00 

323.37 

5.67 

4.21 

0.0813 

0.63 

0.0042 

0.52 

0.0043 

Credit 

16 

690 

33.73 + 1,476.90 

2,073.41 

35.20 

29.30 

0.3322 

0.81 

0.0055 

4.94 

1.6191 

Zoo 

17 

101 

22.33 + 2,107.50 

5,531.81 

26.16 

12.49 

0.1853 

1.11 

0.0048 

12.05 

0.1898 

Tumor 

18 

339 

60.86+ 1,799.99 

18,640.09 

87.35 

39.49 

0.4419 

1.19 

0.0184 

46.31 

0.2937 

Vehicle 

19 

846 

207.27 + 1,886.10 

17,126.45 

171.75 

160.55 

0.9310 

1.49 

0.0175 

9.67 

0.0665 

German 

21 

1,000 

600.19+ 1,849.90 

10,981.29 

540.61 

392.25 

2.8656 

1.68 

0.0207 

146.65 

1.4290 

Synl5 

15 

100 

5.21 + 1,284.00 

889.97 

9.41 

3.56 

0.0667 

0.81 

0.0048 

4.80 

0.0273 



200 

6.28 + 1,286.20 

901.23 

10.29 

4.76 

0.2090 

0.80 

0.0111 

4.53 

0.0256 



500 

9.64+ 1,336.80 

899.42 

9.59 

7.50 

0.0821 

0.85 

0.0051 

1.08 

0.0128 



1,000 

13.69+ 1,364.60 

910.76 

13.35 

12.15 

0.7578 

0.85 

0.0156 

0.33 

0.0038 



2,000 

22.64+ 1,372.10 

907.02 

21.98 

20.81 

0.5200 

0.85 

0.0061 

0.30 

0.0019 



5,000 

54.79+ 1,356.70 

932.96 

52.07 

47.80 

1.0887 

3.99 

0.0235 

0.28 

0.0023 

Letter 

17 

100 

25.53 + 1,572.60 

7,639.02 

20.27 

15.83 

0.0601 

1.15 

0.0062 

2.83 

0.0292 



200 

30.01 + 1,576.80 

7,966.25 

25.87 

20.22 

0.1769 

1.09 

0.0069 

4.21 

0.0304 



500 

39.58 + 1,598.90 

8,257.60 

31.88 

29.84 

0.1575 

0.95 

0.0055 

1.01 

0.0123 



1,000 

52.85 + 1,575.60 

8,380.57 

44.48 

42.93 

0.4835 

0.98 

0.0093 

0.56 

0.0100 



2,000 

77.48 + 1,591.00 

7,619.29 

69.14 

66.34 

0.5184 

2.01 

0.0241 

0.78 

0.0068 



5,000 

157.69+ 1,636.40 

8,134.85 

136.95 

134.94 

I.6I27 

1.36 

0.0097 

0.65 

0.0067 

Insurl9 

19 

100 

101.47 + 1,828.00 

6,745.41 

112.28 

55.85 

0.1540 

1.41 

0.0117 

54.71 

0.5438 



200 

113.79+ 1,896.10 

6,783.76 

82.52 

68.12 

0.4187 

1.45 

0.0120 

12.90 

0.1329 



500 

137.18 + 1,864.40 

6,894.15 

98.96 

91.44 

0.4547 

1.43 

0.0128 

6.07 

0.0358 



1,000 

168.55 + 1,862.30 

6,966.90 

133.87 

123.42 

0.8007 

1.44 

0.0157 

9.00 

0.0601 



2,000 

226.36+ 1,781.70 

7,277.60 

185.02 

177.66 

1.3165 

3.30 

0.0483 

4.06 

0.0356 



5,000 

380.78 + 1,814.80 

7,061.76 

336.03 

329.78 

4.5841 

1.89 

0.0220 

4.34 

0.0713 

Child 

20 

100 

203.44+ 1,785.10 

15,085.86 

223.62 

106.01 

0.3976 

1.67 

0.0176 

115.60 

1.3183 



200 

215.70+ 1,760.80 

14,222.86 

225.76 

119.82 

1.8604 

1.69 

0.0107 

104.09 

0.9659 



500 

248.17 + 1,818.70 

14,016.94 

214.91 

149.73 

0.8344 

1.67 

0.0183 

63.48 

0.5783 



1,000 

292.40+ 1,817.20 

15,504.82 

232.11 

193.18 

1.1041 

1.72 

0.0135 

37.20 

0.3176 



2,000 

371.12+ 1,841.40 

16,109.91 

306.42 

268.78 

2.3209 

1.82 

0.0165 

35.81 

0.3680 



5,000 

589.99+ 1,846.40 

15,372.61 

524.63 

483.90 

6.7403 

1.93 

0.0160 

38.80 

0.5411 


Table 4: Comparison of the DP+MCMC, the iC-best & the IW-DDS in Terms of Time (Time Is in 
Seconds) 


relatively fair comparison in terms of the running time, we used REBEL tool, a C++ implemen¬ 
tation of the DP algorithm of Koivisto (20061, to perform the computation in the DP phase of the 
DP+MCMC; but for fair comparison we changed its scoring criterion into the BDeu score with 
equivalent sample size 1 and set qi{Ui) = 1 and pi{Pai) = 1. To perform the computation in the 
MCMC phase, we used the Matlab implementation of BDAGL[^and we ran it under Windows 7 
on an ordinary laptop with 2.40 Intel Core i5 CPU and 4.0 GB memory. The MCMC used the pure 
global proposal (with local proposal choice /3 = 0) since such a setting was reported by [Eaton and 


Murphy (20071 to have the best performance for edge discovery when up to about 190,000 MCMC 


1. The original BDAGL was found to get an out-of-memory error for any data case with more than 19 variables in our 
experiments. This is because the original BDAGL intends to pre-compute the local scores of all the possible 

families and store them in an array for the later usage in both the DP phase and the MCMC phase. To solve this out-of- 
memory issue, we have updated the original Matlab code in BDAGL and provided the BDAGL-New package which is 
also available at http://www.cs.iastate.edu/~jtian/Software/BNLearner/BNLearner.htm 
The main update is that, with the assumed maximum in-degree, only the local scores of all the families whose sizes 
are no more than the assumed maximum in-degree are pre-computed and stored in a hash table. With BDAGL-New, 
the experiments for all the data cases in this paper can be performed without any error. 
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iterations were performed in their experimental results. We ran totally 190,000 MCMC iterations 
each time and discarded the first 100,000 iterations as the bum-in period. Then we set the thinning 
parameter to be 3 to get the final 30,000 DAG samples. As a result, the time statistics of the DP 
phase (the number before + sign in Table but not the MCMC phase (the number after + sign) can 
be directly compared with the ones of the other two methods. For each data case, we performed 20 
independent MCMC runs based on the DP outcome from REBEL to get the results. 

Eor our IW-DDS, we set No = 30, 000. We performed 20 independent runs for each data case 
to get the results. Eor the iT-best, note that its SAD is fixed since there is no randomness in the 
computed results. So we only ran it once to get the result. We set K to be 100 for Tic-Tac-Toe, 
Glass, Wine, Housing, Credit, Zoo, Synl5, and Letter, that is, we got the 100 best DAGs from 
Tic-Tac-Toe, Glass, Wine, Housing, Credit, Zoo, each of the six cases of Synl5, and each of the six 
cases of Letter. We set K to be only 80 for Tumor because our experiments showed that for Tumor 
the iT-best program ran out of memory with K > 80. Due to the same out-of-memory issue, we set 
K to be only 40 for Vehicle and Insurl9; and we set K to be only 20 for Child and 9 for German. 
The fact that K can only take a value no greater than 40 forn > 19 in our experiments confirms our 
claim about the computation problem of the AT-best algorithm in terms of its space cost. 

Table 1^ shows the experimental results in terms of SAD for each data case while Table shows 
the running time costs corresponding to Tablej^ (Just as Table|^ Tablej^also lists the sample mean 
and the sample standard deviation of the running time of three steps of the DDS in the IW-DDS 
algorithm.) The column named DP in Tablej^shows the SAD (Ylij \p^{i j\ D) -p^{i ^ j\D)\) 
where each edge posterior p^{i ^ j\D) is computed by the exact DP method of Tian and He (|2009 1, 
and each edge posterior (i —)• y | Zl) is computed by the exact DP method of Koivisto (20061. The 
SAD values reported in this column indicate the bias due to the assumption of the order-modular 
prior. Next to the DP column, the SAD values of the three methods are listed in Table Both the 
DP-fMCMC method and the IW-DDS method are random so that both /)(SAD) and d'(SAD) are 
shown for these two methods. The outcome of the AT-best algorithm is not random so that only its 
SAD is shown. Einally Tablej^also shows the cumulative posterior probability mass A for both the 
A'-best algorithm and the IW-DDS method. 

Tables 1^ and 1^ clearly demonstrate the advantage of our method over the other two methods. 
By using much shorter computation time, our method has its /i(SAD) less than the corresponding 
one from the DP-fMCMC for 32 out of the 33 data cases. The only exceptional case is Synl5 with 
m = 200. Eurthermore, based on the two-sample t test with unequal variances, we can conclude at 
the significance level 0.05 that the real mean of SAD using our method is less than the corresponding 
one using the DP-fMCMC for each of the 31 cases; the two exceptional cases are Synl5 with 
m = 100 and Synl5 with m = 200. (Eor 30 out of these 31 cases, the p-value of the two-sample t 
test is less than 0.01.) Meanwhile, d'(SAD) using our method is always much smaller than the one 
using the DP-fMCMC for each of the 33 cases, which indicates higher stability of the performance 
of our method. Similarly, using much shorter computation time, our method has its /i(SAD) less 
than the SAD from the AT-best for 32 out of the 33 cases. The only exception is Synl5 with m = 
5,000. Eurthermore, based on the one-sample t test ( Casella and Berger| [2002 1, we can conclude at 
the significance level 5 x 10“^ fhat the real mean of SAD using our method is less than the SAD 
using the A'-best for each of these 32 cases. 

There are several other interesting things shown in Tables and In terms of the SAD, for 
very small m, /t(SAD) using the DP-fMCMC method is even larger than the SAD from the DP 
phase (Koivisto| 2006) itself. Eor example, for the data case Zoo, the SAD from the DP phase is 
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8.2142 but /i(SAD) obtained after the MCMC phase of the DP+MCMC method is 32.4189. Similar 
situations also occur in Synl5, Letter, Insurl9, and Child when m = 100. This indicates that 
for very small m, the MCMC phase of the DP+MCMC method is unable to reduce the bias from 
the DP method of Koivisto (20061 for all these cases based on 190,000 MCMC iterations. As for 
the running time, please note that fi{T£)p) of our IW-DDS is always less than the running time 
of the DP phase of the DP+MCMC method. This is because the DP step of our method uses the 
DP algorithm of Koivisto and Sood] (2004i, that is, the first three steps of the DP algorithm of 
Koivisto ( |2006 1; while the DP phase of the DP+MCMC method uses all the five steps of the DP 


algorithm of Koivisto (20061. In other words, compared with the DP algorithm of Koivisto and 


|Sood| ( [2004] ), the DP algorithm of [Koivisto ( 2006| l includes a larger constant factor hidden in the 
+ kn2^) notation though these two DP algorithms have the same time complexity. 
This difference will make the total running time of our IW-DDS even less than the running time of 
the DP phase of the DP+MCMC method when the remaining steps of the IW-DDS run faster than 
the last two steps of the DP algorithm of Koivisto (20061. For example, for the data case Child with 
m = 5,000, p.{Tt) of the IW-DDS is 524.63 seconds while the corresponding running time of the 
DP phase of the DP+MCMC method is 589.99 seconds. Actually, Table [^ shows that there are 21 
out of the 33 cases where of the IW-DDS is less than the running time of the DP phase of the 
DP+MCMC method. In addition, just as shown in Section |4~T| the effectiveness of our time-saving 
strategy can also be clearly seen from Table|^ For example, the ratio of fiiTpAc) to f^iTord) ranges 
from 0.07 to 87.29 across the 33 cases, which is much smaller than the upper bound of the ratio of 


0{n^~^^No) to 0{'n?No). 

Table also shows the resulting cumulative probability mass A from the iT-best and our IW- 
DDS for all the data cases. (Note that A is computed by the formula A = P/ /P/ (^)’ 

where (D) is computed using POSTER tool.) In the table, /1(A) from our IW-DDS is greater than 
A from the AT-best for 28 out of the 33 data cases. The five exceptional cases are Zoo, Tumor, Synl5 
with m = 100, Synl5 with m = 5, 000, and Letter with m = 100. Interestingly, for four out of 
the five excepfional cases (as well as fhe ofher 28 cases), /l(SAD) from our IW-DDS is significanfly 
smaller fhan SAD from fhe iC-besf. One possible reason is fhaf fhe K besf DAGs fend fo have fhe 
same or similar local sfrucfures (family {i,Pai)’s) fhaf have relatively large local scores while a 
large number of DAGs sampled from our IW-DDS include various local sfrucfures for each node 
i. When A is far below 1, fhe inclusion of various local sfrucfures seems fo be more effecfive in 
improving fhe sfrucfural learning performance. 


In addition. Table shows fhaf when m is nof very small (such as no smaller fhan 1,000), A 
from our IW-DDS wifh No = 30,000 can reach a large percenfage (such as greafer fhan 90%) in 
mosf of our dafa cases. As a resulf, we can obfain a sound interval for {f\D) wifh a small widfh 
(such as less fhan 0.1) for any fealure /. 

To furfher demonsfrafe fhaf our IW-DDS can obfain a large A efficienfly when m is nof very 
small, we increased No from 100,000 to 600,000 wifh each incremenf 100,000 fo see ifs perfor¬ 
mance for fhe dafa cases Letter wifh m = 500, Child wifh m = 500, and German. Again, we 
performed 20 independenf runs for each dafa case fo gef fhe resulfs. Ligures and [^ show fhe 
increase in A wifh respecf fo fhe increase in No for fhese fhree dafa cases. Correspondingly, Ligures 
and [^indicate fhe increase in jl{Tord) and ^{Tdag) wifh respecf fo fhe increase in No 

for fhese fhree dafa cases, where fhe running fime is in seconds. Combining fhese figures, if is clear 
fhaf our IW-DDS can efficienfly achieve a large A. Take fhe dafa case German for example, wifh 
fhe fime cosf = 1,493.02 seconds, our IW-DDS can collecf No = 600, 000 DAG samples so 
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that the corresponding mean of A can reach 91.74%. Therefore, for any feature / in the data case 
German, our IW-DDS can provide a sound interval for p^{f\D) with width 0.0826. Note that the 
iT-best can only provide a meaningless sound interval for p^{f\D) with huge width 0.922 because 
its A can only reach 0.078 in the data case German before running out of the memory. Also note that 
the ratio of ^{Tuag) to il{Tord) decreases from 56.45 to 30.95 when No increases from 100,000 
to 600, 000. (The increase rate of jl{Tord) is a constant with respect to No', but the increase rate of 
p{T£,ag) actually decreases as No increases.) This witnesses the statement in Remark|^that the 
benefit from our time-saving strategy will typically increase when No increases. 

Finally, we present the experimental results for the IW-DDS by varying the sample size. Just 
as Section |4.1[ the experiments were performed for the five dafa cases Tic-Tac-Toe, Wine, Leffer 
wifh m = 500, Child wifh m = 500, and German. For fhe IW-DDS, we fried fhe sample size No 
= 5,000 • i, where i G {1, 2,..., 10}. For each i, we independenfly ran fhe IW-DDS 20 times fo gef 
fhe sample mean and fhe sample sfandard deviafion of SAD for fhe (direcfed) edge feafures. For fhe 
DP-fMCMC, we ran fofally 50, 000 • i MCMC iferafions, where z G {1, 2,..., 10}. For each i, we 
discarded fhe firsf 25,000 • i MCMC iferafions for “burn-in” and sef fhe fhinning paramefer fo be 5 
so fhaf 5, 000 • i DAGs gof sampled. Again, for each i, we independenfly ran fhe MCMC 20 times 
to get the sample mean and the sample standard deviation of SAD for the edge features. As for the 
iT-best, different experimental settings were used for different data cases due to the out-of-memory 
issue. For two data cases Tic-Tac-Toe and Wine, we ran the iT-best program with K = 20 • z, 
where z G {1,2,..., 10}. (The setting of AT = 20 • z guarantees that for these two data cases the 
running time of the iT-best is longer than the running time of the IW-DDS at each z, which will be 
demonstrated soon.) For the data case Letter with m = 500, we only ran the iF-best with K = 162 
because the AT-best program will run out of memory when K > 162 due to its expensive space cost. 
The corresponding result of the AT-best would be compared with the result of the IW-DDS at z = 10 
(i.e., the IW-DDS with No = 50,000). For the same out-of-memory issue, we only set 77 = 20 for 
Child with m = 500 and set 77 = 9 for German when running the 77-best program. Note that since 
there is no randomness in the outcome of the 77-best algorithm, we always ran the 77-best program 
only once to get its fixed SAD for fhe edge feafures. 

The experimenfal resulfs of comparing fhe fhree mefhods based on fhe dafa case Tic-Tac-Toe are 
shown in Figures and Figure shows fhe SAD performance of fhe fhree mefhods wifh each 
z G {1, 2,..., 10} in ferms of fhe edge feafures, where an error bar represenfs one sample sfandard 
deviafion (t(SAD) across 20 runs from fhe DP-fMCMC or fhe IW-DDS af each z. Figure [T0| shows 
(fhe sample mean of fhe fofal running fime) of fhe DP-fMCMC and fhe IW-DDS as well as 
Tt (fhe fofal running fime) of fhe 77-besf wifh each z, where fhe running fime is in seconds. The 
advanfage of our IW-DDS can be clearly seen by combining Figures]^ and [T^ Comparing wifh fhe 
DP-fMCMC, for each z G {1, 2,..., 10}, fhe IW-DDS uses fhe shorter running fime buf has ifs real 
mean of SAD significanfly smaller fhan fhe corresponding real mean of SAD from fhe DP-fMCMC, 
wifh fhe p-value < 1 x 10“^*^ from fhe fwo-sample t fesf wifh unequal variances. Comparing wifh 
fhe 77-besf, for each z G {1, 2,..., 10}, fhe IW-DDS uses fhe shorter running fime fhan fhe 77-besf, 
buf fhe IW-DDS has ifs real mean of SAD significanfly smaller fhan fhe SAD from fhe 77-besf, wifh 
fhe p-value < 1 x 10“^® from fhe one-sample t lest. Therefore, the learning performance of the 
IW-DDS is significantly better than the performance of the other two methods at each z for the data 
case Tic-Tac-Toe. 


The experimental results based on the data case Letter with m = 500 are shown in Figures 
and 12 Just as the description for Figures and [TOl Figure [TT] shows the SAD performance of the 
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Letter (m = 500) Letter (m = 500) 




Figure 3: Plot of A versus No for Letter (m = Figure 4: Plot of the Running Time versus No 
500) for Letter (m = 500) 


Child (m = 500) 


Child (m = 500) 




Figure 5: Plot of A versus No for Child (m = Figure 6: Plot of the Running Time versus No 
500) for Child (m = 500) 


German 



Figure 7: Plot of A versus No for German 


German 



Figure 8: Plot of the Running Time versus No 
for German 
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Figure 9: Plot of the SAD Performance of the 
DP-fMCMC, the AT-best and the IW-DDS for 
Tic-Tac-Toe 


Figure 10: Plot of the Total Running Time of 
the DP-fMCMC, the AT-best and the IW-DDS 
for Tic-Tac-Toe 


three methods in terms of the edge features, and Figure shows the corresponding time costs of 
the three methods. The only difference is that in both Figure [TT] and Figure [T^ the corresponding 
result of the AT-best with K = 162 is marked as a star and compared with the one of the IW-DDS 
at i = 10. The advantage of our IW-DDS can be clearly seen by combining Figures and 
Comparing with the DP-fMCMC, for each i G {1, 2,..., 10}, the IW-DDS uses the much shorter 
running time but has its real mean of SAD significantly smaller than the corresponding real mean 
of SAD from the DP-fMCMC, with the p-value < 0.013 from the two-sample t test with unequal 
variances. of the IW-DDS at f = 10 is only 33.00 seconds, which is even less than the 

running time of the DP phase of the DP-fMCMC method (39.58 seconds).) Note that (t(SAD) 
from the DP-fMCMC is shown to be very large. The DP-fMCMC has its (t(SAD) even larger than 
its /i(SAD) (the sample mean of SAD) when i > 8, which indicates that the performance of the 
DP-fMCMC is not stable based on the 500,000 MCMC iterations. Comparing with the AT-best, 
the IW-DDS with No = 50, 000 uses the much shorter running time but has its real mean of SAD 
significantly smaller than the SAD from the AT-best with AT = 162 since the p-value from the 
corresponding one-sample t test is less than 1 x 10“^^. Therefore, the learning performance of the 
IW-DDS is also significantly better than the performance of the other two methods for the data case 
Letter with m = 500. 


12 


The experimental results for the other three data cases Wine, Child with m = 500, and German 
are represented similarly in the supplementary material. The same conclusion about the learning 
performance can be clearly drawn by examining the figures shown in the supplementary material. 
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Letter (m = 500) 



Figure 11: Plot of the SAD Performance of 
the DP+MCMC, the AT-best and the IW-DDS 
for Letter (m = 500) 



Figure 12: Plot of the Total Running Time of 
the DP+MCMC, the AT-best and the IW-DDS 
for Letter (m = 500) 


4.3 Learning Performance of Non-modular Features 


In Sections |47T] and |4f2] we did not provide experimental results on the learning performance of non- 
modular features. We did not do so in Section [T2] because there is no known method to compute 
the true/exact posterior probability of any non-modular feature p^{f\D) except by the brute force 
enumeration over all the (super-exponential number of) DAGs so that the quality of the correspond¬ 
ing p^{f\D) learned from any approximate method cannot be precisely measured. We did not do 
so in Section [4d] because the current PO-MCMC tool (BEANDisco) only supports the estimation 
of the posterior of an edge feature so that the comparison of our method and the PO-MCMC can 
only be made for the edge feature. (Thus, we did not make the comparison for the path feature 


(which is one particular non-modular feature), though the DP algorithm of Parviainen and Koivisto 


(20111 can compute the exact posterior of a path feature p_^(/|A)).) Our idea is that by showing that 
our algorithms have significantly better performance in computing fundamental structural features 
(directed edge features), which should be due to the better quality of our DAG samples with respect 
to the corresponding p^{G\D) or p^{G\D), we expect that they will also be superior in computing 
other complicated structural features using the same set of DAG samples. 


To verify our expectation, we performed the experiments on the real data set “Iris” (with re = 5) 


from the UCI Machine Learning Repository (Asuncion and Newman 

2007 

1 and the well-studied 

data set “Coronary” (Coronary Heart Disease) (with re = 6) (|Edwards| 

2000 

I. Since re is small. 


by enumerating all the DAGs, we were able to compute p^{f\D), the true posterior probability 
for any interesting non-modular feature /. For the demonstration purpose, we investigated the 
following five interesting non-modular features, /i, a directed path feature from node x to node y, 
denoted by x ~> y, represents the situation that variable x eventually influences variable y. / 2 , a 
limited-length directed path feature x ~> y that has its path length no more than 2, represents that 
variable x can influence variable y via at most one intermediate variable, /a, a combined path feature 
X y z, can be interpreted as the situation that variable x eventually influences variable y 
which in turn eventually influences variable z. f^, a combined path feature y <~ x ~> 2; with 
y ^ z, means that variable x eventually influences both variable y and variable 2;. /s, a combined 
path feature y <~ x z with x ^ z, represents that variable x eventually influences variable 
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y but not variable 2 ;. Then we compared the SAD performance of the (directed) edge feature with 
the corresponding SAD performanceof each feature fj (j G {1, 2,3,4, 5}) from the DP+MCMC, 
the iT-best and the IW-DDS. The experimental results on both data sets show that if the SAD of 
the IW-DDS is significantly smaller than the SAD of the competing method (the DP-fMCMC or the 
iT-best) for the edge feature, then the SAD of the IW-DDS will also be significantly smaller than 
the SAD of the competing method for each investigated non-modular feature fj using the same set 
of DAG samples. Thus, our expectation is supported by the experiments. The detailed experimental 
results are as follows. 

The following is our experimental design for the data set Coronary with n = 6 and m = 1841. 
For the IW-DDS, we tried the sample size No = 2, 500 • i, where i G {1,2,..., 20}. For each i, we 
independently ran the IW-DDS 20 times to get the sample mean and the sample standard deviation 
of SAD for the (directed) edge feature and the five non-modular feafures (/i, / 2 , /a, and /s). 
For fhe DP-fMCMC, we ran 25,000 • i MCMC iferafions, where i G {1,2,..., 20}. For each i, we 
discarded fhe firsl 12, 500 • i MCMC iferafions for “burn-in” and sef fhe fhinning paramefer fo be 
5 so fhaf 2, 500 • i DAGs gof sampled. For each i, we independenfly ran fhe MCMC 20 limes fo 
gef fhe sample mean and fhe sample sfandard deviafion of SAD for fhe edge fealure, /i, / 2 , /a, 
and /s. For fhe iT-besl, we ran fhe iT-besl wifh K = 10 ■ i, where i G {1, 2,..., 20}. For each i, 
we ran fhe iT-besl jusl once fo gel SAD for fhe edge fealure, /i, / 2 , /a, /4 and /s since Ihere is no 
randomness in fhe oulcome of fhe iT-besl algorilhm. 

The experimenlal resulls for fhe dala sef Coronary are demonslrated from Figure [T^ fo Figure 
[T8| Figure [T^ shows fhe SAD performance of fhe Ihree melhods wifh each i for fhe edge fealure, 
where an error bar represenfs one sample sfandard deviafion across 20 runs for fhe DP-fMCMC or 
fhe IW-DDS al each i. Correspondingly, Figure fo Figure show fhe SAD performance of 
fhe Ihree melhods wifh each i for fhe five investigated non-modular feafures (/i, / 2 , /a, and /s) 
respectively. Combining Figure [T^ and each of Figures [T^ [T^ [T^ and [T^ one can clearly see 
that if the SAD of the IW-DDS is significantly smaller than the SAD of the competing method (the 
DP-fMCMC or the AT-best) for the edge feature, then the SAD of the IW-DDS will also be signifi¬ 
cantly smaller than the SAD of the competing method for each of the five invesligaled non-modular 
feafures. More specifically, comparing wifh fhe DP-fMCMC, al each i G {1,2,... 20}, for fhe edge 
fealure, fhe real mean of SAD from fhe IW-DDS is significanlly smaller lhan fhe corresponding 
one from fhe DP-fMCMC wifh fhe p-value < 0.01 from fhe Iwo-sample t fesl wifh unequal vari¬ 
ances. Consislenlly, al each i G {1, 2,... 20}, for each investigated non-modular fealure fj, fhe 
real mean of SAD from fhe IW-DDS is also significanlly smaller lhan fhe corresponding one from 
fhe DP-fMCMC wifh fhe p-value < 0.01 from fhe same t fesl. Comparing wifh fhe iF-besl, al each 
i G {1,2,... 20}, for fhe edge fealure, fhe real mean of SAD from fhe IW-DDS is significanlly 


2. More specifically, SAD = \P-^(x ~> y\D) — p^{x ~> y\D)\) for the path feature x ~> y, SAD = 

2/|f2) — P-j^{x ~> y\D)\) for the path feature x ~> y whose length is no more than 2; SAD 
= (flxyz y AD) ~ y -*112)1) for the combined feature x ~> y ~> 2 ; SAD = 

C^xyz y^z \P-A (y ® AD) — Pyi. ijj <~ X ~> z\D)\) for the combined feature y <~ x ~> 2 with y f z; 
SAD = \Py^ {y X o‘'> z\D) — p^{y <~ x no> 2 |D)|) for the combined feature y x n^> z with 

X z/z Z. 

3. The purpose of the experimental setting for the DP+MCMC and the 7G-best is merely to testify the claimed “if-then” 
conditional statement: if the SAD performance from the IW-DDS is significantly better than the SAD performance 
from the competing method for an edge feature, then the SAD performance from the IW-DDS will also be signifi¬ 
cantly better than the one from the competing method for each investigated non-modular feature using the same set 
of DAG samples. 
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Coronary 



Figure 13: SAD of the Learned Edge Features 
for Coronary 


Coronary 



Figure 15: SAD of the Feamed /2 Features 
for Coronary 


Coronary 



Figure 17: SAD of the Feamed /4 Features 
for Coronary 
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Figure 14: SAD of the Feamed /i Features 
for Coronary 
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Figure 16: SAD of the Feamed /a Features 
for Coronary 
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Figure 18: SAD of the Feamed /s Features 
for Coronary 
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Figure 19: SAD of the Learned Edge Features Figure 20: SAD of the Learned fi Leatures 
for Iris for Iris 



Ligure 21: SAD of the Learned /2 Leatures Ligure 22: SAD of the Learned /a Leatures 
for Iris for Iris 



Ligure 23: SAD of the Learned /4 Leatures Ligure 24: SAD of the Learned /s Leatures 
for Iris for Iris 
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smaller than the SAD from the K-best with the p-value < 1 x 10“^'^ from the one-sample t test. 
Consistently, at each i G {1, 2,... 20}, for each investigated non-modular feature fj, the real mean 
of SAD from the IW-DDS is also significantly smaller than the SAD from the if-best with the 
p-value < 1 X 10“^° from the same t test. 

We also performed the same kind of experiments for the data set Iris with n = 5 and m = 150. 
The results are demonstrated from Figure[T^to Figure]^ Comparing with the DP-i-MCMC, at each 
i G {1,2,... 20}, just as the comparison result for the edge feature, for each investigated fj, the 
real mean of SAD from the IW-DDS is also significantly smaller than the corresponding one from 
the DP-i-MCMC with the p-value < 1 x 10“^ from the two-sample t test with unequal variances. 
Comparing with the if-best, at each i G {1,2,... 20}, just as the comparison result for the edge 
feature, for each investigated fj, the real mean of SAD from the IW-DDS is also significantly 
smaller than the SAD from the if-best with the p-value < 1 x 10“® from the one-sample t test. 
Thus, a conclusion similar to the one from Coronary can be drawn: if the SAD of the IW-DDS is 
significantly smaller than the SAD of the competing method for the edge feature, then the SAD of 
the IW-DDS will also be significantly smaller than the SAD of the competing method for each of 
the five investigated non-modular features. 


4.4 Performance Guarantee for the DDS Algorithm 


To testify the quality guarantee for the estimator based on the DDS algorithm (Corollary (iv)), 
we performed experiments based on two data cases (Letter with m = 100 and Tic-Tac-Toe), which 
have relatively large fi{SAD) or /i(MAD) ( = /i(SAD) /(n(n — 1)) ) from the DDS algorithm 
shown in Table [T] Based on the hypothesis testing approach, we can conclude with very strong 
evidence that the performance guarantee for our estimator holds for both data cases. The details of 
the experiments are as follows. 


For the first set of experiments, we choose the data case Letter with m = 100, which has the 
largest /i(SAD) (= 0.2948) from the DDS among all the 33 data cases shown in Table [T] We first 
consider the setting of the parameters specified as e = 0.02 and S = 0.05, which serves as our 
performance requiremenf. By seffing fhe DAG sample size No = [(ln(2/(5))/(2e^)] = 4612, we 
infend fo show fhaf fhe estimator p^(flD) coming from our DDS has fhe performance guaranfee 
such fhaf fhe Hoeffding inequalify p(|p-<(/|D) — p^{f\D)\ > e) < 6 holds. Each directed edge 
fealure / is investigated here since fhe posterior of each edge p^{f\D) can be easily obfained by 
using fhe DP algorifhm of Koivisfo (2006|. For each edge /, we call fhe event of \p^{f\D) — 
p_< {f\D)\ > e as the event of violation (of the pre-specified estimation error bound) in the learning 
of /. Define fhe indication variable W for fhe evenf of violation in fhe learning of /. Thus, VF is a 
Bernoulli random variable wifh fhe success probabilify p^o = p{\p^{f\D) — p^{f\D)\ > e). We 
independenfly repeaf fhe DDS algorifhm (wifh fhe same No) R = 400 times and use fhe average of 
W as fhe esfimafor p^io for each edge. Note fhaf fhe mean of p^o is p^io and fhe variance of p^io 
is Poto(1 — Pvio)/R because Rpvio has a binomial disfribufion wifh fhe frial number R and success 
probabilify p^io- Since we expecf fhaf p^io will be small so fhaf p^,^o(l — Pvio) will be large relafive 
to Pvio, we choose large R = 400 fo make fhe variance of pvio relafively small wifh respecf fo fhe 
mean of pvio- Figure [25] shows fhe histogram of pvio for each of all fhe n(n — 1) = 272 directed 
edges. For each of fhe 272 edges, if can be clearly seen fhaf fhe corresponding pvio is much smaller 
fhan 6 = 0.05 marked by fhe verfical bar. For 240 ouf of fhe 272 edges, fhe corresponding pvio’^ 
are exacfly equal fo 0. Even for fhe largesf pvio = 0.015, corresponding fo 6 successes among 400 
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trials, we can use the one-sided hypothesis testing to reject the null hypothesis that p^io > 0.05 
and to conclude that p^io < 0.05 with the p-value less than 2 x 10“^. Therefore, the Hoeffding 
inequality holds for the learning of each edge in this parameter setting. 

Next, we consider another setting of the parameters with e = 0.01 and S = 0.02, which has more 
demanding performance requirement. By setting the DAG sample size No = [(ln(2/(i))/(2e^)] 
= 23026, we want to show that the estimator p^ {f\D) coming from our DDS has the performance 
guarantee satisfying the Hoeffding inequality. With the same logic, we independently repeat the 
DDS algorithm (with the same No) R = 1250 times and use the average of W as the estimator 
pyio for each edge. (Here we choose even larger R since we expect that pyio will be smaller in this 
parameter setting.) Figure shows the histogram of pyio for each of all the 272 directed edges. 
For each edge, it can be clearly seen that the corresponding pyio is much smaller than 6 = 0.02. 
Even for the largest pyio = 0.004, corresponding to 5 successes among 1250 trials, we can use the 
one-sided hypothesis testing to reject the null hypothesis that pyio > 0.02 and to conclude that pyio 
< 0.02 with the p-value less than 2 x 10“®. Therefore, the Hoeffding inequality also holds in this 
parameter setting. 

For the second set of experiments, we choose the data case Tic-Tac-Toe, which has the largest 
/l(MAD) (= /x(SAD) /(n(n — 1)) = 0.1547/90) from the DDS among all the 33 data cases shown 
in Table [T] The same kind of experiments are performed for this data case. For the parameter 
setting with e = 0.02 and S = 0.05, the corresponding result is shown in Figure]^ For each of 
the 90 edges, the corresponding pyio is clearly much smaller than S = 0.05. Even for the largest 
pvio = 0.0125, corresponding to 5 successes among 400 trials, we can use the one-sided hypothesis 
testing to conclude that pyio < 0.05 with the p-value less than 6 x 10“^. Eor the parameter setting 
with e = 0.01 and 5 = 0.02, the corresponding result is shown in Eigure Eor each edge, the 
corresponding pyio can be clearly seen to be much smaller than <5 = 0.02. Even for the largest pyio 
= 0.0056, corresponding to 7 successes among 1250 trials, we can use the one-sided hypothesis 
testing to conclude that pyio < 0.02 with the p-value less than 3 x 10“®. Thus, the Hoeffding 
inequality also holds in the set of experiments for this data case. 

Einally, for the data case Tic-Tac-Toe, we fix e = 0.02 but increase No from 2,000 to 10,000 
by an increment of 1, 000 each time. Eor each No, we plot the Hoeffding bound for the 

probability of violation = p(|p_<(/|Zl)—p_<(/|Zl)| > e) in Eigure [2^ (Note that the Hoeffding 
bound decreases at an exponential rate as No increases.) Then for each No, we also plot both the 
maximum and the mean of all the n{n — 1) pyio's in Eigure]^ Again pyio for each edge is the 
average of W by independently running the DDS algorithm (with the same No) R times. We set 
R = max{A00, [10/(e“^^°^ )]} and use the larger R for the larger No since we expect that pyio 
will be smaller for the larger No- Erom Eigurewe can clearly see that the maximum of all 
the n{n — 1) pyio’s, is always far below the Hoeffding bound for each No- Eurthermore, for each 
No we can use the one-sided hypothesis testing to reject the null hypothesis that pyio > 
and to conclude that pyio < with the p-value less than 3 x 10“'^. Therefore, the Hoeffding 

inequality holds for each No- 

5. Conclusion 

We develop new algorithms for efficiently sampling Bayesian network structures (DAGs). The 
sampled DAGs can then be used to build estimators for the posteriors of any features of interests. 
Theoretically we show that our estimators have several desirable properties. Eor example, unlike 
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Letter (e = 0.02, 5 = 0.05, N =4612) 


Letter(e = 0.01,5 = 0.02, N =23026) 
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Estimated Probability of Violation 

Figure 25: Histogram of Estimated Probabil¬ 
ities of Violation in Edge Eeaming for Eetter 
(m = 100) with e = 0.02, 6 = 0.05 and 
No = 4612 
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Estimated Probability of Violation 

Eigure 26: Histogram of Estimated Probabil¬ 
ities of Violation in Edge Eeaming for Eetter 
(m = 100) with e = 0.01, 6 = 0.02 and 
No = 23026 


Tic-Tac-Toe t = 0.02, 8 = 0.05, = 4612) 



Eigure 27: Histogram of Estimated Probabil¬ 
ities of Violation in Edge Eeaming for Tic- 
Tac-Toe with e = 0.02, 6 = 0.05 and No = 
4612 



Estimated Probability of Violation 


Eigure 28: Histogram of Estimated Probabil¬ 
ities of Violation in Edge Eeaming for Tic- 
Tac-Toe with e = 0.01, 6 = 0.02 and No = 
23026 
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Tic-Tac-Toe f = 0.02) 



Sample Size 


Tic-Tac-Toe ^ = 0.02) 



Sample Size 


Figure 29: Plot of Probability of Violation versus No for Tic-Tac-Toe with e = 0.02 


the existing MCMC algorithms, the estimators based on the DDS algorithm satisfy the Hoeffding 
bound and therefore enjoy the quality guarantee of the estimation with the given number of samples. 
Empirically we show that our estimators considerably outperform the previous state-of-the-art with 
or without assuming the order-modular prior. 

Our algorithms are capable of estimating the posteriors of arbitrary (non-modular) features (the 
DDS under the order-modular prior, the IW-DDS under the structure-modular prior); while the exact 
algorithms are available for computing mod ular features under the order-modular prior with time 
0{n^~^^C{m) + A:n2") and space 0(n2”) (Koivisto and Sood 2004 Koivisto 2006); comput¬ 


ing path features under the order-modular prior with time + n3”) and space 0(3 

(Parviainen and Koivisto[|20TT l; and computing modular fe atures under the stru cture-modular prior 


with time 0(n*^"'“^0(m) -|- kn2^ -|- 3”) and space 0(n2”) (Tian and He 


20091. The bottleneck of 


our algorithms is their first computation step, the DP algorithm of Koivisto and Sood (20041 whose 
space cost is 0(n2”). Therefore the application of our algorithms is limited to the data sets on 
which the DP algorithm of Koivisto and Sood (20041 is able to run - up to around 25 variables in 
current desktops, while a parallel implementation of the DP algorithm has been demonstrated on a 
data set with 33 variables using a cluster including totally 2,048 processors and 8,192 GB memory 
(IChen et al.l 120141). 
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Appendix A. Proofs of Propositions, Theorems and Corollary 

This appendix provides the proofs of the propositions, theorems and corollary in the paper. 

A.l Proof of Proposition!^ 

We first prove a lemma for Proposition 1. 

Let an order ^ be represented as (ui,..., an) where ai is the fth element in the order. 

Lemma 8 The probability that the last n — k + 1 elements along the order are a^, a^+i,. ■ ■ ,an 
respectively is given by 

n 

p{ak,ak+i,...,an,D) = L' 

i=k 

where U^.=V- {ai, ai+i,an}- 

Proof 

^fc+1; • • • D) 

^ ^ i^kt ^k-\-li • • • 5 

n 

E n (from®) 

i=k 

n 

= (from®) 

i=k 


Proposition[T]can be directly proved from the conclusion of Lemma[^according to the definition 
of conditional probability. 


A.2 Proof of Proposition]^ 

Proof 

From Proposition!^ and L'{U^^ = 0) = 1, we have 

n 

WpiakWk+i, ...,an,D) 


k=l 


L'iV) 
p{<,D) 


pAd) 

= P{<\D), 


(from ( [T3] ) & ( fT4| ) 


which proves Eq. (20 1 . 
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A.3 Proof of Theorem |3] 


Proof 

First, we show that each DAG G sampled according to our DDS algorithm has its pmf Ps{G) 
equal to the exact posterior {G\D) by assuming the order-modular prior. 

On one hand, from the following derivation, we can get the exact form for p_< {G\D)-. 


PAf\D) 

= Y^p{^\D)p{f\ ^,D) 

= Y,Pi^\D)Y.f(G)p{G\ <,D) 

-< gca 

= EE f{G)p{<,G\D) 

-< gca 

= E (28) 

G ^s.t.GC^ 


Thus, for each possible DAG Gi, by setting /(G) to be the indication function I[G = Gj] and then 
relating Eq. (281 with Eq. (|4|l, we know that p^{Gi\D) = ^ GiC^Pi~^^ Gi\D) for each G*. 


On the other hand, the event that a DAG G gets sampled according to our DDS algorithm occurs 
if and only if one of the orders that G is consistent with gets sampled in Step 2 of the DDS algorithm 
and then G gets sampled from the sampled order in Step 3 of the DDS algorithm. Therefore, based 
on Total Probability Eormula, Ps(G) = t Gc-< p{^ \D)p{G\ ^,D) = E^s.t.Gc^Pi^,G\D) 
= pAG\d). 

Second, since No orders are sampled independently and each DAG per sampled order is sam¬ 
pled independently, p^(Gi, G 2 ,..., GatJD) = n*EE-<stGc-<p(^ \D)p{Gi\ = nE 

p^Gm. 

Therefore, Theorem]^ is proved. 


A.4 Proof of Corollary]^ 

Proof 

Eor each Gi in the DAG set {Gi, G 2 ,..., Gno} sampled from the DDS algorithm, /(Gj) G 
{0,1}. Since Gi, G 2 ,... ,Gno are iid with pmf p^{G\D) (by Theorem]^, /(Gi), /(G 2 ),..., 
/(Gno) are iid with Bernoulli pmf. 

Eor each Gi, the following is true for E{f{Gi)), the expectation of /(Gj). 

Eim)) 

= Y.f{G)p^{G\D) 

G 

= pM\d). 

Thus, /(Gi), /(G 2 ),... ,f{GNo) are iid with Bernoulli(p_<(/|D)). In other words, /(Gi), 
/(G 2 ),..., /{Gno) are independent; and for each Gi, P{f{Gi) = 1) = p^{f\D) and P{f{Gi) = 
0) = l-p^(/|D). 
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(i) Proof that p^{f\D) is an unbiased estimator for p^{f\D), that is, E{p^{f\D)) = p^{f\D). 


e{pM\d)) 


= E 



No 


, No 

■‘■^o ■ 1 
1=1 

= ^^NopM\d) 
= pAf\D). 


(ii) Proof that p^ {f\D) converges almost surely to p_< {f\D). 

Since f{Gi), /(G 2 ),.. • ,/(GArJ are iid with E{f{Gi)) = p^{f\D) < 00 , and since 


, No 

i=l 


based on the Strong Law of Large Numbers (Theorem 5.5.9) (Casella and Berger 2002), p^ {f\D) 
converges almost surely to p^ {f\D) (as No —)• 00 ). 

Note that, by Theorem 2.5.1 ( Athreya and Lahin) 20061, the property that p^{f\D) converges 
almost surely to p^{f\D) implies that p^{f\D) converges in probability to p^{f\D), that is, 
p^{f\D) is a consistent estimator for p^{f\D). 

(iii) Proof that if 0 < p^{f\D) < 1, then the random variable 

VN-o{pM\D)-pMm 

y/p^mi-PAm) 

has a limiting standard normal distribution, denoted by 


VNo{pM\D)-pMm U 

^pMmi-pMm 

Since f{Gi), /(G 2 ),..., /{Gno) Hd with Bernoulli(p_<(/|i9)), for each Gi, the following 
is true for Var{f{Gi)), the variance of f{Gi). 


Var{f{G,)) 

= E{f{G,m-E{f{G,))) 

=pM\D)a-pAfm 

< 00 . 


Since 0 < p^{f\D) < 1, Var{f{Gi)) is also strictly greater than 0. 

Again, we have already known that E{f{Gi)) = p_< {f\D) < 00 . 

Thus, by the Central Limit Theorem (Theorem 5.5.15) ( [Casella and Berger||2002 1, 
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VN-o{p^{f\D) - EifjG))) 
y'VarifiG)) 

that is, 


VKjpMlD) - pMm 

^pM\D){-^-pMm 


Since p^{f\D) converges in probability to p^{f\D), denoted by p^{f\D) 
the Continuous Mapping Theorem (Theorem 5.5.4) ( [Casella and Bergerj 2002| , 

^p^{f\D){l-pM\D))^^ ^/pM\m'^-pM\D))- 


Finally, by Slutsky’s Theorem (Theorem 5.5.17) ( [Casella and Berger}|2002[ ), 


P-<(/|T»), by 


VN-o{p^{f\D)-p^{f\D)) 

^pM\D){-^-pMm 

(iv) Proof that for any e > 0, any 0 < <5 < 1, if A^o > (lii(2/5))/(2e^), then p{\p^{f\D) — 


PA(/l^)l<e)>l-5- 

Since f{Gi), /(G' 2 ),..., /(GaTo) are iid with Bemoulli(p_<(/|i7)), the Hoeffding bound (Ho- 
effdingt[l963[|Koller and Friedman||2009| l holds: 


Pi\P-<if\D) - p^if\D)\ >e)<2e 


This is equivalent to 


Pi\P-<if\D) - p^{f\D)\ < e) > 1 - 2e 

and the conclusion is implied straightforward. 


A.5 Proof of Equation ( |2T] | 

Proof For any j G {1,..., n}: 

P{{a„U,^)\D) 
oc P{{aj,U„.),D) 

= ^ ^ ^ ^ ^*((71,..., (Tj — l, (Tj, crj_|_i, ..., (Tn,, 79) 

(tJl _i) .,< 7 ^ 1 ) 

£C{Uaj) ^C(V-Uaj-Wj}) 

n 

= E E 

((Tl,...,crj_l) ((Tj + 1,...,(T„) j —1 

= a',^{U,.)L'{U^^)E!iy - - {a,}). 
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A.6 Proof of Equation ( |22| l 
Proof 


P<{G\D) 

= Pi^^G\D) 

-<s.t.GC^ 


P^{D) 


PAD) 

1 

PAD) 

1 

pAD) 

1 

pAd) 


pAd) 

pAd) 

pAd) 


-<s.t.GC^ 


Y, pA,G)p{D\G) 


-<s.t.GC-< 


E (H qiiUi)pi{Pai))p{D\G) 


As.t.GCA i=l 
n 


E ([lP^iPai))piD\G) 


^s.t.GC^ i=l 


E P{G)p{D\G) 


^s.t.GC~( 


■<G I • pAG-, d) 
pg I -pAgId). 


Since both pAD) > 0 and p^{D) > 0, p^{G\D) oc | | • p/(G|D), which also has been 

shown by [Ellis and Wong (20081. 


A.7 Proof of Theorem 


Proof 

Let n denote the set of all the DAGs. Define = {G G D : p^{G,D) > 0}. U'^ will 
equal D if the user chooses a prior such that p{G) > 0 for every DAG G (such as a uniform DAG 
prior p{G) = 1). However, if the user has some additional domain knowledge so that he/she sets 
some prior to exclude some DAGs a priori, will be a proper subset of D. Note that p^{f\D) = 
pAf^D)/p^{D), where p^(/,E») = Y.g fiG)pAG, D) = Y1 g£U+ fiG)pAG, D), and p^{D) 

= pAf ^l,D) = EgPAG,D) = EGeu+PAG,D). 


Let /[] denote the indicator function. Rewrite p^{f\D) in Eq. (24i as p/{f, D)/p^{D), where 


PAf^D) = EGegfiG)pAG,D) = EgAG) pAG,D)I[G eg] = EGemfiG)pAG,D)I[ 
G G g], and PAD) = PAf ^1,D) = EGegPAG,D) = EgPAG,D)I[G € g] = Eg^u+ 
pAG,D)I[G€g]. 

Note that by Theorem]^ P{G G ^) = 1 — (1 —p^{G\D))^°, where p^{G\D) is the exact poste¬ 
rior of G under the order-modular prior assumption. Also note that by Eq. (22 1 , VG : {p^{G, D) > 
d)^{pAG\D)>d). 


42 


















Structure Learning in BNs of Moderate Size by Efeicient Sampling 


(i) Proof that p^{f\D) is an asymptotically unbiased estimator for p^{f\D), that is, 

lim Eip4f\D))=p4f\D). (29) 

No^oo 

For notational convenience, let 7 denote p^{f,D), and let r denote p^{D). Define g{'y,T) 
= so that g{'y,T) is essentially p^{f\D). 

Note that 

^( 7 ) 

= E{p4f,D)) 

= e [ f{G)p4G,D)I[Geg]] 

\Geu+ / 

= E{f{G)p4G,D)I[Geg]) 

G&U+ 

= E f{G)pAG,D)P{Geg) 

G&A+ 

= E f{G)p^{G,D){l-{l-p^{G\D)f‘>). 

G&A+ 

Thus, 


lim ^( 7 ) 

No^oo 


No^oo I ^^ 


vGew+ 


E f{G)p^(G, D) hm (l-(l-p^(G|D))^°) 

^ ^ No^oo 

G(PJ+ 


= E f{G)p4G,D) 

G&A+ 

= pAf’E>)- 


Similarly, by setting / = 1, we have 
E{t) 

= E{p4D)) 

= E pAG,D){l-{l-p^{G\D)f‘>), 

GGU+ 


and 

lim E{t)=pAD). 

No^oo 
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Next, by Taylor’s Theorem (with the Lagrange form of the remainder), 


= g{E{^),E{T)) + 


dg{'y,T) 

d'j 


+ 


dr 


J 'y=E{'y),T=E{T) 


-I ■y=E{'y)^T=E{T) 

{t*-E{t)), 




where 7 * = E{'^) + 0(7 — £'( 7 )), t* = E{t) + 6{t — E{t)), and 6 is a random variable such 
that 0 < 0 < 1 . 

Examine the components separately: 


g{E{^),E{T)) = E{^)/E{T). 


dg{i,T) 

J '^=E('y) ,t=E(t) 
] 7=ii/(7),r=ii/(r) 

= {E{r))-\ 


dg{i,T) 

9 t j ^=E(-f),T=E(T) 

= [-liT) ]^=E('y),T=E{T) 

= -E{^){E{t))-\ 


Since neither -^( 7 ) nor E{t) is random, 

E{g{i,T)) 

= E{-i)/E{T) + {E{t))-^E{^* - E{-f)) - E{^){E{t))-^E{t* - E{t)) 

= E{j)/E{t) + {E{T))-^E{e{j - E{^))) - E{^){E{r))-^E{6{T - E{t))). 


Consider the limit of each component separately: 

lim {E{-f)/E{T)) 

No—>-OD 

= f lim ^( 7 )) / ( lim E{t)] 

\No^oo J \No^oo J 

= pAf’E)/pAE) 

= PAf\D)- 
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lim {E{t)) ^ 

Nq—^oo 


= ( lim E(t)] 

\No^oo J 


-1 


lim -E{-f){E{T)) 2 

No^oo 

= — f lim f lim 

Note that both {p^{D))~^ and D){p^{D))~‘^ are constant real numbers. 

Finally, we intend to prove the following two equalities: 

lim E{e{-f - E{-^))) = 0, (30) 

No^oo 

lim E{e{T-E{T))) = Q. (31) 

No^oo 

Once this is done, 

lim E{g{-f,T)) 

No^CSD 

= li™ + lim iE{T))-^ ■ 0 + lim {-E{-f){E{T))-^) ■ 0 

No^oo No^oo No^oo ' 

= PAf\D)- 

The whole proof for Eq. ( |2^ will then be done. 

The proof for Eq. ( [^ is as follows. 

Based on the definition of limit, proving Eq. (|30|) is equivalent to proving 


lim |E;( 6 »( 7 -.E( 7 )))| = 0 . 

No—^oo 


(32) 


Since 


\E{e{j - E{^)))\ 

<i^(|0(7-S(7))l) 

= E;(|0|-|7-i?(7)l) 

< (|7 — F^( 7 )|) (due to 0 < 0 < 1 ), 


and|i?(0(7-S(7)))| > 


U, LU piUVC E/q. 




lim £’(| 7 -E;( 7 )|) = 0 . 

No^oo 


(33) 
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= E 


= E 


Y, f{G)p4G,D)I[Geg] - Y f{G)p4G,D)P{Geg) 

G&A+ G&A+ 


Y f[G)p^{G,D){I[G & g] - P{G & g)) 

G&A+ 


< ^ f{G)p^iG, D) \I[G G a] - P{G G g)\ 


.G&A+ 


Y f{G)p^{G,D)E{\I[G^g]-P{G^g)\) 


G&A+ 


= Y f{G)p^{G,D)[{l-P{Geg))P[Geg) 

G&A+ 

+ (P(GGg)-0)(l-P(GGg))] 

= E f{G)pAG,mp{G^g){i-p{G^g))\ 

GGU+ 

= Y f{G)p^{G,D)[2{l-{l-p^{G\D)fo)x{l-p^{G\D))^<^]. 

G&A+ 

Since VG G^/+ : p^{G\D) > 0, 

lim E /(G)p^(G,P)[2(1-(1-p^(G|P))^°)x(1-p^(G|P))E 


No^oo 


G&U+ 


= Y f{G)pAG,D) Ihn [2(1-(1-p^(G|P))^°)x(1-p^(G|P))E 

^^ A^o —>■00 

Gew+ 

= E fiG)p4G,D)-0 

G&A+ 

= 0 , 

Eq. ( |3^ is proved, and the proof for Eq. ( |30| ) is done. 

Setting / = 1 in Eq. ( [30| ) leads to Eq. ( [^ . 


Thus, the whole proof for Eq. (291 is done. 

(ii) Proof that p^{f\D) converges almost surely to p^{f\D), denoted by p^{f\D) — 

pAf\D)- 

Remember that Q. denotes the set of all the DAGs, that is, Q. = {Gi, G 2 ,..., Gw*}, where W* 
is |n|, the number of all the DAGs. Note that W* is a finite positive integer though it is super¬ 
exponential in the number of variables re. 


Eet P be V{Q), the power set of D. Thus, P is a cr—algebra on D (Athreya and Eahiri 20061. 
Define for any A G F,p{A) = Yhc &AP<{Gj\D). If is well-known fhaf/r is a probabilify measure 


on F so fhaf (fl, F, p) is a probabilify space (Afhreya and Eahiri 20061. 
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For each i > 1, let Qi = Q, = T and = fi. Let G^‘^\ ...) : G flj, i > 

1}. Let a cylinder set 4 = x ^2 x • • • x x ^fc+i x f^fc +2 x ..., where there exists I < k < 00 

such that Ai £ Ti for 1 < i < k and Ai = Qi for i > k. Let = a < {A : 4 is a cylinder set 
} >, that is, is a cr—algebra generated by the set of all the 4 ’s- is a a it— algebra on 

and is called a product (t— algebra. 

Define, for each (^ 1 , ^ 2 , • ■ ■, ^^fc+ 2 , ■ • •) G ^ 2 , • • •, ^fc+l) ^fc+ 2 ) 

...) = fj,i{Ai) X ^ 2 (^ 2 ) X ... X Hk{Ai^). By Kolmogorov’s consistency theorem, , n°°) 

is a probability space ( [Athreya and Lahin||2006| l. 

Let = {(G(i),G(2),...) g : 3G E ZY+ : Vf > 1 : G® 4 G}. Let O'*’! = 
5^00 _ j^oo,o_ -Thus, = {(Gi^i, G(2\ ...) G : VG G : 3i > 1 : G« = G}. 

Define, for each G D°°, 

_ E«=»^-/(G)P/(G.P)J[Ggg~°(^°°)] 

’* Ecffl*P,<(G,0)/[Ge5«.(a,~)| ’ 

where Q^°{u°°) is fhe DAG sef including fhe firsf No coordinafes of u}°°. By fhe definifion, we 
know fhaf (c(;°°) = p^{f\D). 

For each uj°° G 0°°’^, for each G G U^, lef N{G,IjJ°°) be fhe smallesf infeger such fhaf 
q{N{g, uj"^)) _ Q ]^g|- = maxQ^K+N{G,oj°°). Then for each No > N{uj°°), for each 

GG^f+,/[GGg^°(a;“)] = L 

Accordingly, for each uj°° G fhere exisfs N{uj°°) such thaf for each No > N{uj°°): 

pfG^) 

^ EG 6 »t/(G)pAG.P) 

'^Geu+ 

= pAf\D)- 

This implies fhaf limAr^_>.ooP^°(w°°) = p^{f\D) for each G 
Finally, we infend fo prove fhe following equalify: 

^ 00 (J^ 00 , 1 ) ^ ^ (. 34 ) 

Once fhis is done, fhe whole proof for p^(/|D) — p^{f\D) is done. 

Proving Eq. ( [34| ) is equivalenf fo proving 

^oo(^oo,0) ^ 0. (35) 

For any G G D, lef = {(G^i), G^^),...) g : Vi > 1 : G® 4 G}. Thus, 0“’° 

= UGew+ Accordingly, < EGeW+ 

For each j > 1, lef = {(G^^), G^^),...) g : Vi G j} : G® 4 G}. 

Nofe 3 j]oo,o,G ,2 D ,.. D j^oo.o.Gj-i 3 ^oofi,G,3 for each j > 1. Thus, = 

n“ 1 
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Finally, for any G G U~^, 


KG,j] 


i=i 

= lim 
j^oo 

= lim Hi{n - {G}) X - {G}) X ... X - {G}) 

j^oo 

= lim [1 - ^({G})]^ 

J-^OO 

= lim [1 — p_<(G|Z))]-^ 

j^oo 

= 0 . 


Thus, Y2GeU+ = 0, so that Eq. (351 is proved. The whole proof is done. 


Note that, by Theorem 2.5.1 (Athreya and Lahiri 


2006), the property thatp^{f\D) converges al¬ 


most surely to p^{f\D) implies thatp^(/|i2) converges in probability to p^{f\D), that is, p^{f\D) 
is a consistent estimator for p^{f\D). 

(iii) Proof that the convergence rate of p^{f\D) is o{a^°) for any 0 < o < 1. 

In the proof for (ii), we have shown that for each io°° G there exists N{uj°°) such 

that for each No > N{uj°°), p^°{io°°) = p^{f\D). This means that for any 0 < o < 1, 

{a^°)-^[p^°{uj°°) -p^{f\D)] = 0. Thus, limjv„^oo(a^°)“^[p^°(w“) - p^{f\D)] = 0 so that 
the proof is done. 

(iv) Proof that if the quantity A = ^ — Py^ifl^) ^ ^ ' 

p4f\D) + l-A. 


The proof is essentially the same as the proof for Proposition 1 of |Tian et aL] ( |2010[ ) which 
proves Eq. ( pT] ), an equivalent form of Eq. ( [26] ). The direct proof for Eq. (26 1 is also provided in the 
supplementary material. 
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